Supplementary Information: Holmes et al. 2020

5. E. coli Sakai etp T2SS operon and in vitro expression at 18degC

This notebook describes linear modelling of in vitro expression for etp genes in E. coli Sakai, and the production of an analogous figure to figure 5 from the manuscript.

In this experiment, GFP reporter activity was used to measure gene expression (separately) from the 5` UTR or either etpC (508 bp) or etpD (211bp), in E. coli Sakai. The bacteria were grown in MOPS medium suopplemented either with glucose or glycerol, and Two biological replicates were used for each of the etpC and etpD constructs.

Expression values were measured as Relative Fluorescence Units (RFU) in late exponential phase, corrected for background from the promoterless reporter plasmid pKC026 measured at the same optical density, as raw values. These raw RFU were normalised for cell density.

Data

The data for analysis are provided in the accompanying file long_form_data.tab.

# Load data
data = read.table("datasets/long_form_data.tab", header=T, sep="\t")

# Convert biological replicates to factors
data$biological_replicate = as.factor(data$biological_replicate)

# Show data in tabular form
data %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width=F)

biological_replicate	gene	medium	OD	raw	normalised
1	etpD	glycerol	0.798	4717.0710	5911.1160
1	etpD	glycerol	0.798	4316.2010	5408.7730
1	etpD	glycerol	0.798	4368.2510	5473.9980
2	etpD	glycerol	0.995	4870.8940	4895.3700
2	etpD	glycerol	0.995	4935.6540	4960.4560
2	etpD	glycerol	0.995	4843.4940	4867.4533
3	etpC	glycerol	0.972	952.5186	979.9574
3	etpC	glycerol	0.972	956.3586	983.9081
3	etpC	glycerol	0.972	815.9486	839.4533
4	etpC	glycerol	1.150	875.8493	761.6081
4	etpC	glycerol	1.150	839.5193	730.0168
4	etpC	glycerol	1.150	806.8293	701.5907
1	etpD	glucose	0.890	1022.6530	1149.0480
1	etpD	glucose	0.890	1013.2230	1138.4530
1	etpD	glucose	0.890	1085.1430	1219.2620
5	etpD	glucose	0.975	1382.6100	1418.0610
5	etpD	glucose	0.975	1459.7300	1497.1580
5	etpD	glucose	0.975	1514.5600	1553.3940
3	etpC	glucose	0.910	252.1696	277.1095
3	etpC	glucose	0.910	234.1596	257.3183
3	etpC	glucose	0.910	251.6196	276.5051
6	etpC	glucose	1.050	321.8280	306.5029
6	etpC	glucose	1.050	360.0180	342.8743
6	etpC	glucose	1.050	339.4480	323.2838

Visualising normalised data

We can visualise the normalised data directly, with superimposed boxplots to estimate the range of data, split by medium and gene:

ggplot(data, aes(x=gene, y=normalised)) +
  geom_jitter(aes(color=biological_replicate, shape=medium), width=0.1, height=0, size=4, alpha=0.6) +
  geom_boxplot(data=data %>% filter(medium=="glycerol"), aes(x=gene), fill=NA, width=0.1) +
  geom_boxplot(data=data %>% filter(medium=="glucose"), aes(x=gene), fill=NA, width=0.1) +
  labs(y="gfp fluorescence (RFU)")

It is clear by sight that the fluorescence in glycerol is greater than that in glucose, for both genes. Similarly, by sight, it is clear that (in the same medium) fluorescence is greater for etpD than for etpC.

Also, while there is a visible effect on measured fluorescence associated with the biological replicate, this effect is small with respect to the difference between media and genes, so as a first approximation we do not account for this batch effect.

Estimating medium and gene effects

We can fit a linear model (LM) to the data, in order to estimate the effective difference in fluorescence between etpC and etpD, and between media. We construct two models, each fitting the normalised fluorescence to a combination of gene and medium explanatory variables.

# Fit GLM with interactions
model1 = lm(normalised ~ gene + medium + gene * medium, data=data)

# Fit GLM without interactions
model2 = lm(normalised ~ gene + medium, data=data)

The first thing we can test is whether there is a (statistically) significant difference between the two models. The models differ only by inclusion of an interaction term: the combined effect of changing the gene and the medium. If the fit to the data is much better when we include this interaction we should prefer the more complex model (model1). Otherwise, we’d prefer the simpler model. The practical interpretation of this is that, if the simpler model (model2) fits the data about as well as the more complex model (model1), we can assume that the effect of the gene on fluorescence is essentially independent of the effect of changing medium. If the more complex model fits significantly better, then we can interpret this as there being a combined effect of the medium and gene, interacting in some way, on the measured fluorescence.

# Compare the models
anova(model2, model1)

## Analysis of Variance Table
## 
## Model 1: normalised ~ gene + medium
## Model 2: normalised ~ gene + medium + gene * medium
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)    
## 1     21 18337557                                 
## 2     20  1118301  1  17219256 307.95 1.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison produces an ANOVA table comparing the two models. The Df result is equal to one, meaning that we have a single additional parameter in the more complex model. The very small P-value (10^-13) indicates that the addition of this parameter has a statistically significant effect on the model fit, and that we should prefer the more complex model.

# Generate a table of the model fit
anova(model1)

## Analysis of Variance Table
## 
## Response: normalised
##             Df   Sum Sq  Mean Sq F value    Pr(>F)    
## gene         1 44587585 44587585  797.42 < 2.2e-16 ***
## medium       1 29825651 29825651  533.41 6.809e-16 ***
## gene:medium  1 17219256 17219256  307.95 1.290e-13 ***
## Residuals   20  1118301    55915                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table for the more complex model has very small P-values for all three parameters: gene, medium and gene:medium interaction. This indicates that all three parameters have a statistically significant effect on the measured fluorescence.

# Show model coefficients
model1

## 
## Call:
## lm(formula = normalised ~ gene + medium + gene * medium, data = data)
## 
## Coefficients:
##             (Intercept)                 geneetpD           mediumglycerol  
##                   297.3                   1032.0                    535.5  
## geneetpD:mediumglycerol  
##                  3388.1

The model coefficients can be interpreted as follows.

(Intercept) can be taken to represent a “baseline” fluorescence. In this case, it is the estimated (mean) fluorescence for the gene etpC in MOPS, supplemented with glucose, and has value 297.3 units. All other coefficients represent changes to this baseline.
geneetpD is the estimated effect of changing the gene to etpD (but remaining in the glucose-supplemented medium). The change of gene (in the same medium) is estimated to increase measured fluorescence by 1032 units.
mediumglycerol is the estimated (mean) effect of changing the supplement for the medium from glucose to glycerol, for etpC. This has the value 535.5 units.
geneetpD:mediumglycerol is the estimated (mean) effect of changing both the supplement medium and the gene at the same time. This has the value 3388.1, which is much larger than the estimated additive effect of changing gene or supplement alone (1032 + 535.5 = 1567.5). This can be interpreted to indicate that there is an enhancing effect of the glycerol supplement on etpD-associated fluorescence, greater than would be expected from a linear combination of the individual effects of changing gene or medium supplement.

Conclusions

There are clear statistically significant effects on measured fluorescence due to:
- choice of gene (etpC vs etpD)
- choice of MOPS supplement (glucose vs glycerol)
Glycerol has a statistically significant enhancing effect on measured GFP fluorescence associated with etpD.