5. E. coli Sakai etp T2SS operon and in vitro expression at 18degC

This notebook describes linear modelling of in vitro expression for etp genes in E. coli Sakai, and the production of an analogous figure to figure 5 from the manuscript.

In this experiment, GFP reporter activity was used to measure gene expression (separately) from the 5` UTR or either etpC (508 bp) or etpD (211bp), in E. coli Sakai. The bacteria were grown in MOPS medium suopplemented either with glucose or glycerol, and Two biological replicates were used for each of the etpC and etpD constructs.

Expression values were measured as Relative Fluorescence Units (RFU) in late exponential phase, corrected for background from the promoterless reporter plasmid pKC026 measured at the same optical density, as raw values. These raw RFU were normalised for cell density.

Data

The data for analysis are provided in the accompanying file long_form_data.tab.

# Load data
data = read.table("datasets/long_form_data.tab", header=T, sep="\t")

# Convert biological replicates to factors
data$biological_replicate = as.factor(data$biological_replicate)

# Show data in tabular form
data %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width=F)
biological_replicate gene medium OD raw normalised
1 etpD glycerol 0.798 4717.0710 5911.1160
1 etpD glycerol 0.798 4316.2010 5408.7730
1 etpD glycerol 0.798 4368.2510 5473.9980
2 etpD glycerol 0.995 4870.8940 4895.3700
2 etpD glycerol 0.995 4935.6540 4960.4560
2 etpD glycerol 0.995 4843.4940 4867.4533
3 etpC glycerol 0.972 952.5186 979.9574
3 etpC glycerol 0.972 956.3586 983.9081
3 etpC glycerol 0.972 815.9486 839.4533
4 etpC glycerol 1.150 875.8493 761.6081
4 etpC glycerol 1.150 839.5193 730.0168
4 etpC glycerol 1.150 806.8293 701.5907
1 etpD glucose 0.890 1022.6530 1149.0480
1 etpD glucose 0.890 1013.2230 1138.4530
1 etpD glucose 0.890 1085.1430 1219.2620
5 etpD glucose 0.975 1382.6100 1418.0610
5 etpD glucose 0.975 1459.7300 1497.1580
5 etpD glucose 0.975 1514.5600 1553.3940
3 etpC glucose 0.910 252.1696 277.1095
3 etpC glucose 0.910 234.1596 257.3183
3 etpC glucose 0.910 251.6196 276.5051
6 etpC glucose 1.050 321.8280 306.5029
6 etpC glucose 1.050 360.0180 342.8743
6 etpC glucose 1.050 339.4480 323.2838

Visualising normalised data

We can visualise the normalised data directly, with superimposed boxplots to estimate the range of data, split by medium and gene:

ggplot(data, aes(x=gene, y=normalised)) +
  geom_jitter(aes(color=biological_replicate, shape=medium), width=0.1, height=0, size=4, alpha=0.6) +
  geom_boxplot(data=data %>% filter(medium=="glycerol"), aes(x=gene), fill=NA, width=0.1) +
  geom_boxplot(data=data %>% filter(medium=="glucose"), aes(x=gene), fill=NA, width=0.1) +
  labs(y="gfp fluorescence (RFU)")

It is clear by sight that the fluorescence in glycerol is greater than that in glucose, for both genes. Similarly, by sight, it is clear that (in the same medium) fluorescence is greater for etpD than for etpC.

Also, while there is a visible effect on measured fluorescence associated with the biological replicate, this effect is small with respect to the difference between media and genes, so as a first approximation we do not account for this batch effect.

Estimating medium and gene effects

We can fit a linear model (LM) to the data, in order to estimate the effective difference in fluorescence between etpC and etpD, and between media. We construct two models, each fitting the normalised fluorescence to a combination of gene and medium explanatory variables.

# Fit GLM with interactions
model1 = lm(normalised ~ gene + medium + gene * medium, data=data)

# Fit GLM without interactions
model2 = lm(normalised ~ gene + medium, data=data)

The first thing we can test is whether there is a (statistically) significant difference between the two models. The models differ only by inclusion of an interaction term: the combined effect of changing the gene and the medium. If the fit to the data is much better when we include this interaction we should prefer the more complex model (model1). Otherwise, we’d prefer the simpler model. The practical interpretation of this is that, if the simpler model (model2) fits the data about as well as the more complex model (model1), we can assume that the effect of the gene on fluorescence is essentially independent of the effect of changing medium. If the more complex model fits significantly better, then we can interpret this as there being a combined effect of the medium and gene, interacting in some way, on the measured fluorescence.

# Compare the models
anova(model2, model1)
## Analysis of Variance Table
## 
## Model 1: normalised ~ gene + medium
## Model 2: normalised ~ gene + medium + gene * medium
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)    
## 1     21 18337557                                 
## 2     20  1118301  1  17219256 307.95 1.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison produces an ANOVA table comparing the two models. The Df result is equal to one, meaning that we have a single additional parameter in the more complex model. The very small P-value (10^-13) indicates that the addition of this parameter has a statistically significant effect on the model fit, and that we should prefer the more complex model.

# Generate a table of the model fit
anova(model1)
## Analysis of Variance Table
## 
## Response: normalised
##             Df   Sum Sq  Mean Sq F value    Pr(>F)    
## gene         1 44587585 44587585  797.42 < 2.2e-16 ***
## medium       1 29825651 29825651  533.41 6.809e-16 ***
## gene:medium  1 17219256 17219256  307.95 1.290e-13 ***
## Residuals   20  1118301    55915                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table for the more complex model has very small P-values for all three parameters: gene, medium and gene:medium interaction. This indicates that all three parameters have a statistically significant effect on the measured fluorescence.

# Show model coefficients
model1
## 
## Call:
## lm(formula = normalised ~ gene + medium + gene * medium, data = data)
## 
## Coefficients:
##             (Intercept)                 geneetpD           mediumglycerol  
##                   297.3                   1032.0                    535.5  
## geneetpD:mediumglycerol  
##                  3388.1

The model coefficients can be interpreted as follows.

Conclusions