This notebook describes linear modelling of in vitro expression for etp genes in E. coli Sakai, and the production of an analogous figure to figure 5 from the manuscript.
In this experiment, GFP reporter activity was used to measure gene expression (separately) from the 5` UTR or either etpC (508 bp) or etpD (211bp), in E. coli Sakai. The bacteria were grown in MOPS medium suopplemented either with glucose or glycerol, and Two biological replicates were used for each of the etpC and etpD constructs.
Expression values were measured as Relative Fluorescence Units (RFU) in late exponential phase, corrected for background from the promoterless reporter plasmid pKC026 measured at the same optical density, as raw
values. These raw
RFU were normalised for cell density.
The data for analysis are provided in the accompanying file long_form_data.tab
.
# Load data
data = read.table("datasets/long_form_data.tab", header=T, sep="\t")
# Convert biological replicates to factors
data$biological_replicate = as.factor(data$biological_replicate)
# Show data in tabular form
data %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width=F)
biological_replicate | gene | medium | OD | raw | normalised |
---|---|---|---|---|---|
1 | etpD | glycerol | 0.798 | 4717.0710 | 5911.1160 |
1 | etpD | glycerol | 0.798 | 4316.2010 | 5408.7730 |
1 | etpD | glycerol | 0.798 | 4368.2510 | 5473.9980 |
2 | etpD | glycerol | 0.995 | 4870.8940 | 4895.3700 |
2 | etpD | glycerol | 0.995 | 4935.6540 | 4960.4560 |
2 | etpD | glycerol | 0.995 | 4843.4940 | 4867.4533 |
3 | etpC | glycerol | 0.972 | 952.5186 | 979.9574 |
3 | etpC | glycerol | 0.972 | 956.3586 | 983.9081 |
3 | etpC | glycerol | 0.972 | 815.9486 | 839.4533 |
4 | etpC | glycerol | 1.150 | 875.8493 | 761.6081 |
4 | etpC | glycerol | 1.150 | 839.5193 | 730.0168 |
4 | etpC | glycerol | 1.150 | 806.8293 | 701.5907 |
1 | etpD | glucose | 0.890 | 1022.6530 | 1149.0480 |
1 | etpD | glucose | 0.890 | 1013.2230 | 1138.4530 |
1 | etpD | glucose | 0.890 | 1085.1430 | 1219.2620 |
5 | etpD | glucose | 0.975 | 1382.6100 | 1418.0610 |
5 | etpD | glucose | 0.975 | 1459.7300 | 1497.1580 |
5 | etpD | glucose | 0.975 | 1514.5600 | 1553.3940 |
3 | etpC | glucose | 0.910 | 252.1696 | 277.1095 |
3 | etpC | glucose | 0.910 | 234.1596 | 257.3183 |
3 | etpC | glucose | 0.910 | 251.6196 | 276.5051 |
6 | etpC | glucose | 1.050 | 321.8280 | 306.5029 |
6 | etpC | glucose | 1.050 | 360.0180 | 342.8743 |
6 | etpC | glucose | 1.050 | 339.4480 | 323.2838 |
We can visualise the normalised data directly, with superimposed boxplots to estimate the range of data, split by medium and gene:
ggplot(data, aes(x=gene, y=normalised)) +
geom_jitter(aes(color=biological_replicate, shape=medium), width=0.1, height=0, size=4, alpha=0.6) +
geom_boxplot(data=data %>% filter(medium=="glycerol"), aes(x=gene), fill=NA, width=0.1) +
geom_boxplot(data=data %>% filter(medium=="glucose"), aes(x=gene), fill=NA, width=0.1) +
labs(y="gfp fluorescence (RFU)")
It is clear by sight that the fluorescence in glycerol is greater than that in glucose, for both genes. Similarly, by sight, it is clear that (in the same medium) fluorescence is greater for etpD than for etpC.
Also, while there is a visible effect on measured fluorescence associated with the biological replicate, this effect is small with respect to the difference between media and genes, so as a first approximation we do not account for this batch effect.
We can fit a linear model (LM) to the data, in order to estimate the effective difference in fluorescence between etpC and etpD, and between media. We construct two models, each fitting the normalised fluorescence to a combination of gene
and medium
explanatory variables.
# Fit GLM with interactions
model1 = lm(normalised ~ gene + medium + gene * medium, data=data)
# Fit GLM without interactions
model2 = lm(normalised ~ gene + medium, data=data)
The first thing we can test is whether there is a (statistically) significant difference between the two models. The models differ only by inclusion of an interaction term: the combined effect of changing the gene and the medium. If the fit to the data is much better when we include this interaction we should prefer the more complex model (model1
). Otherwise, we’d prefer the simpler model. The practical interpretation of this is that, if the simpler model (model2
) fits the data about as well as the more complex model (model1
), we can assume that the effect of the gene on fluorescence is essentially independent of the effect of changing medium. If the more complex model fits significantly better, then we can interpret this as there being a combined effect of the medium and gene, interacting in some way, on the measured fluorescence.
# Compare the models
anova(model2, model1)
## Analysis of Variance Table
##
## Model 1: normalised ~ gene + medium
## Model 2: normalised ~ gene + medium + gene * medium
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 18337557
## 2 20 1118301 1 17219256 307.95 1.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The comparison produces an ANOVA table comparing the two models. The Df
result is equal to one, meaning that we have a single additional parameter in the more complex model. The very small P-value (10^-13) indicates that the addition of this parameter has a statistically significant effect on the model fit, and that we should prefer the more complex model.
# Generate a table of the model fit
anova(model1)
## Analysis of Variance Table
##
## Response: normalised
## Df Sum Sq Mean Sq F value Pr(>F)
## gene 1 44587585 44587585 797.42 < 2.2e-16 ***
## medium 1 29825651 29825651 533.41 6.809e-16 ***
## gene:medium 1 17219256 17219256 307.95 1.290e-13 ***
## Residuals 20 1118301 55915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table for the more complex model has very small P-values for all three parameters: gene
, medium
and gene:medium
interaction. This indicates that all three parameters have a statistically significant effect on the measured fluorescence.
# Show model coefficients
model1
##
## Call:
## lm(formula = normalised ~ gene + medium + gene * medium, data = data)
##
## Coefficients:
## (Intercept) geneetpD mediumglycerol
## 297.3 1032.0 535.5
## geneetpD:mediumglycerol
## 3388.1
The model coefficients can be interpreted as follows.
(Intercept)
can be taken to represent a “baseline” fluorescence. In this case, it is the estimated (mean) fluorescence for the gene etpC in MOPS, supplemented with glucose, and has value 297.3 units. All other coefficients represent changes to this baseline.geneetpD
is the estimated effect of changing the gene to etpD (but remaining in the glucose-supplemented medium). The change of gene (in the same medium) is estimated to increase measured fluorescence by 1032 units.mediumglycerol
is the estimated (mean) effect of changing the supplement for the medium from glucose to glycerol, for etpC. This has the value 535.5 units.geneetpD:mediumglycerol
is the estimated (mean) effect of changing both the supplement medium and the gene at the same time. This has the value 3388.1, which is much larger than the estimated additive effect of changing gene or supplement alone (1032 + 535.5 = 1567.5). This can be interpreted to indicate that there is an enhancing effect of the glycerol supplement on etpD-associated fluorescence, greater than would be expected from a linear combination of the individual effects of changing gene or medium supplement.