06/06/2017

1. Who, me?

My Background

I'm not really a computer scientist…

  • 1992-1996 BSc(Hons) Forensic and Analytical Chemistry
    University of Strathclyde
  • 1995 Product Development Chemist
    Mobil Oil
  • 1996 Body Fluid Analysis
    Glasgow Royal Infirmary
  • 1996-1999 PhD Computational Biology
    University of Strathclyde

  • 1999-2003 PostDoc Systems Biology
    University of Wales, Aberystwyth
  • 2003-2011 Computational Biologist
    Scottish Crop Research Institute
  • 2005-2013 BA(Hons) Mathematics
    Open University
  • 2011-present Computational Biologist
    The James Hutton Institute

Earlier work

Learning on the job…

  • Sequence/structure evolution of snake venom toxins
  • Drug target site discovery algorithm

  • Systems Biology
  • Modelling yeast metabolism
  • Directed evolution

2. The James Hutton Institute

The James Hutton Institute

  • formed from Scottish Crop Research Institute (SCRI) and Macaulay Land Use Research Institute (MLURI) in 2011
  • main sites: Dundee, Aberdeen
  • also: Glenshaugh, Balruddery, Hartwood
  • oversees Biomathematics and Biostatistics Scotland (BioSS)
  • University of Dundee Division of Plant Sciences based at Dundee site
  • Scottish Government policy advice

Vision and Mission

Vision

  • "To be at the forefront of innovative and transformative science for sustainable management of land, crop and natural resources that supports thriving communities."

Mission

  • "To conduct excellent science and engage in new ways of working across disciplines, with business, policy and society, that guide contemporary thought and challenge conventional wisdom, ensure trust and deliver the best outcomes for all."

Group Objective

  • "To deliver greater food and environmental security through science connecting land and people"

ICS Hutton

@huttonics

3. Plant-pathogen interactions

A Global Challenge

  • feed 9.2bn people (86% in developing world) by 2050
  • double food production in 50yr
  • food losses to pathogens:
    • 10-25% of planted crops
    • 10% of post-harvest crops
    • (enough to feed >2bn people, 2kcal/day)
  • pathogens moving polewards
  • severe threats to ecosystem services


Challenges in Scotland

  • food security
    • burden, cost of crop disease
    • emerging pathogens (imports, climate change)
  • environmental sustainability
    • pesticide minimisation, withdrawal
    • durable resistance via breeding (and/or GM)
  • £308m cereals
  • £258m other crops (£171m potatoes)
  • £264m horticulture




Key Potato Pathogens

Soft rot enterobacteria

  • Pectobacterium spp., Dickeya spp., Erwinia spp.
  • many species, host ranges
  • plant cell wall degrading enzymes
  • rots plants in field, tubers in storage
  • €30m/yr losses in Netherlands
  • P. atrosepticum main cause of blackleg in Scotland
  • D. solani recently emerged, very aggressive

Potato late blight

  • Phytophthora infestans
  • Global problem
  • estimated $6bn/yr cost worldwide
  • UK costs ≈£55m/yr in losses, ≈£70m/yr in control
  • requires frequent pesticide application (incl. for organics)
  • (related spp. cause extensive damage to trees: larch, oak, juniper)

How do pathogens infect?

How do pathogens infect?

Models Guide Thinking

Prevailing interaction model: "Zig-Zag"

  • 1. plant detects pathogen (PTI)
  • 2. pathogen produces effector (ETS)
  • 3. plant detects effector (ETI)
  • model is evolutionary
  • not a dynamic interaction
  • not a specific timescale
  • not a specific biological scale
  • qualitative only

Dynamic (Toy) Model

Dynamic (Toy) Model

Dynamic (Toy) Model

4. Genomes I
(Phytophthora and potato)

P. infestans genome

published 2009, focus on pathogen effectors

  • effector complement affects host range, aggression
  • P. sojae/P. ramorum sequences available for comparison
  • two-speed genome (expansion) enhances diversity in effectors

Effector classification

RxLR effectors

Genome-scale predictions

Potato genome

NB-LRR predictions

R-gene enrichment

commercial gene enrichment bead 'array', based on NB-LRR model

  • "physical BLAST search"
  • design 'bait' from model
  • capture gDNA with 'bait'
  • sequence, assemble/map to genome
  • identified 338 additional candidate NB-LRRs
  • engineering durable resistance
  • R-gene 'stacking' to evade adaptable pathogens

5. Genomes II (bacteria)

riboSeed

  • NUI Galway
  • Teagasc
  • Nick Waters
  • Fiona Brennan
  • Florence Abram
  • Ashleigh Holmes
  • part of larger project on environmental E. coli

Short Reads and Repeats

  • most genome sequencing is (for now) short read Illumina
  • reads shorter than repeat sequence \(\implies\) repeats not resolved

16S Metabarcoding

16S central to microbial ecology (microbiome)

  • universal, essential structural RNA - low variation
  • all 16S metabarcoding methods rely on reference databases
  • bacteria commonly have several repeated very similar copies
  • 16S does not assemble well: most published genomes collapse 16S

rDNA differences

rDNA flanking regions differ within a genome, not between genomes

  • identify clusters from related bacteria
  • treat clusters in isolation
  • 'pin' unique reads to the corresponding reference cluster
  • distribute common reads among all clusters

riboSeed Algorithm Sketch

  • reference \(\leftarrow\) closely-related genome
  • while iters < N
    • map reads to reference
    • clusters \(\leftarrow\) reference rDNA clusters
    • pseudocontigs \(\leftarrow\) {}
    • for c in clusters
      • extract reads mapping to cluster and flanking region
      • hybrid assemble extracted reads with cluster
      • add assembly to pseudocontigs
    • reference \(\leftarrow\) joined pseudocontigs
    • iters++
  • use pseudocontigs as 'trusted long reads' in hybrid assembly

riboSeed Graphical Overview

Simulated Reads

artifical chromosome: spaced E.coli rDNA clusters

  • de novo assembles 0/7 clusters
  • de fere novo assembles 4/7 clusters (E. coli reference)
  • de fere novo assembles 1/7 clusters (Klebsiella reference)

Reference Genome Choice

most rDNA clusters assembled when ref. mutation rate \(\leq\) 0.03

Real Data Performance

hybrid (PacBio/Illumina assembly reference)

  • Illumina-only de novo asm: 1/4; de fere novo asm: 4/4

Closed scaffold of Illumina-only Staphylococcus aureus UAM-1 draft genome

future directions

  • HMM profile for homologous rDNA, rather than reference genome 'bait'
  • assemble SRA/ENA public datasets

riboSeed

6. Diagnostics

Diagnostics & Classification

  • Scottish Government-funded
  • Sonia Humphris
  • Ian Toth
  • Emma Campbell

SRE Global Distribution

SRE Taxonomy

  • enterobacterial taxonomy is difficult to resolve, in general

Historical classification mostly phenotypic, polyphasic

  • all SRE originally Erwinia (1950s - bucket classification)
  • now three distinct genera (Erwinia, Dickeya, Pectobacterium)
  • binomial nomenclature not designed for large amounts of genome data, or metadata curation
  • old names hold over in literature, collections
  • name discontinuities affect analyses, databases

Historical Taxonomy

Legislation is Taxonomy-Based

European and Mediterranean Plant Protection Organisation (EPPO)

  • member states should regulate D. dianthicola and E. amylovora as quarantine pests (A2 list)

Seed Potatoes (Scotland) Amendment Regulations (2010)

  • zero tolerance policy for all Dickeya spp. on potatoes in Scotland to ensure production of `clean' (disease-free) seed potato production for export


consortium for control and epidemiology

Taxonomy For Legislation and Policy

Easy to incorporate into legislation (binary classification)

  • Assumes taxonomy is assigned precisely and correctly
  • Assumes taxonomy is a proxy for risk

Essentially a data structure problem!

  • Historical phenotypic classification is semi-arbitrary (≈binary tree)
  • Is a species concept appropriate for bacteria? (No)
  • Is disease risk only transferred parent to offspring? (No: HGT)
  • Is there 1:1 mapping from genome/phenotype to disease? (No)
  • Are all currently known bacteria correctly classified? (No)

Dickeya diagnostics

  • Having recently sequenced 25 Dickeya genomes, we were asked to develop new diagnostics

To legislate effectively, must discriminate and identify the pathogen

  • MLST/16S/qPCR schemes exist, trained on 'old' classifications
    • limited resolution, considerable natural variation
  • MLST/16S/qPCR are small parts (<5kbp) of a larger genome (≈5Mbp)
    • good choice for one group, maybe not for another
  • qPCR is cheaper than WGS (for now)
  • no qPCR primers existed to distinguish among Dickeya spp.

Whole-Genome Diagnostics Design

Classification is a Problem!

Consequences of Misclassification

Real-world impacts of misclassification

  • False positives (type I errors):
    • clean samples rejected: economic cost
    • farms quarantined/closed: economic/societal cost
  • False negatives (type II errors):
    • (irreversible) introduction of infectious material
    • potential for novel host jumps and spread

≈18% of genomes in public databases misclassified by species

  • accurate classifications essential for diagnostics training
  • reclassification of genomes in public databases necessary

Successful Design

Precise Classification

A New Outbreak Workflow

7. Whole-genome classification

DNA-DNA hybridisation

"Gold standard" whole-genome classification since 1960s

  • "70% identity" ≈ same species
  • denature DNA from two organisms
  • allow DNA to anneal, measure temperature change
  • replace with whole-genome comparisons?

Average Nucleotide Identity (ANIm)

pyani

ANIm: Dickeya

ANIm: Pectobacterium

ANI Criticisms

  • 95% identity threshold is arbitrary
  • not a phylogenetic relationship ('just' sequence similarity)
  • not a functional interpretation of disease risk
  • ANI considers 'homologous' regions (variable definition)
  • is this misled by lateral gene transfer?
  • is homology a good proxy for disease disk?

ANIm: Pectobacterium

ANIm: Dickeya

Whole Genome Classification

ANI Results Define Graphs

ANIm of all sequenced SRE genomes. Edges > 50% coverage

  • three main groups (genera)

Cliques

cliques - k-complete graphs - are 'natural' clusterings

  • clique membership varies with ANI %identity
  • clique membership (at given %ID) is permanent and scales

at some %identity values, all graph components are cliques

Network Deconstruction

Network Deconstruction

Network Deconstruction

Network Deconstruction

Network Deconstruction

Reclassification: Pectobacterium

Reclassification: Dickeya

Reclassification: Erwinia

JHI Collections

JHI holds historical pathogen samples from 1950s onwards

  • sequence historical samples
  • sequence current/recent outbreak samples
  • environmental variation
  • historical changes (evolution, introduction events)

Sequenced ≈50 P. atrosepticum isolates from infections (2009-2015)

  • sourced from SASA
  • Illumina sequencing
  • prokka, roary, QUAST, parSNP

SNP distribution

SNPs widespread across all P. atrosepticum genomes

parSNP Clades

P. atrosepticum divisible into four clades

  • clades contain isolates found outwith UK/EU (from GenBank)

Distribution

all four clades widespread, no obvious geographical pattern

Pangenomes

15% of P. atrosepticum genes are 'accessory'

Accessory genes

accessory gene tree not congruent with the SNP tree

8. Acknowledgements

Without Whom…