This is a tutorial, following through Chris Fonnesbeck's primer on using PyStan with Bayesian Multilevel Modelling.
%pylab inline
import numpy as np
import pandas as pd
import pystan
import seaborn as sns
import clean_data
sns.set_context('notebook')
We can also extend partial pooling to a linear model of the relationship between measured log(radon), the prevailing state radon level, and the effect of the floor at which the measurement was made, where the county affects the effect of making the measurement at different floors. In this linear model, the measured radon level in a household $y_i$ is a function of the floor at which measurement took place, $x_i$, with parameters $\alpha$ (the prevailing radon level in the state) and $\beta_{j[i]}$ (the influence of the floor, which varies by county), and residual error $\epsilon_i$.
$$y_i = \alpha + \beta_{j[i]} x_i + \epsilon_i$$
In this linear model, the floor effect $\beta_{j[i]}$ is the slope, with random Normal effect:
$$\beta_{j[i]} \sim N(\mu_{\beta}, \sigma_{\beta}^2)$$
The residual error is also sampled from a Normal distribution:
$$\epsilon_i \sim N(0, \sigma_y^2)$$
As in the previous example, varying intercept, the multilevel modelling approach allows parameter distributions - information to be shared across groups, which can lead to more reasonable estimates of parameters with relatively little data.
We define the model in stan
, as usual specifying data
, parameters
, transformed parameters
and model
blocks. The model is built up as follows.
Our observed log(radon) measurements ($y$) approximate an intermediate transformed parameter $\hat{y}$, which is normally distributed with variance $\sigma_y^2$. $\sigma_y$ is sampled from a Uniform distribution.
$$y \sim N(\hat{y}, \sigma_y^2)$$ $$\sigma_{y} \sim U(0, 100)$$
The transformed variable $\hat{y}$ is a linear function of $x_i$, the floor at which radon is measured. The parameters are the value of $\alpha$ across the whole state, and the effect due to which floor is used for measurement, and this is associated with the county $i$ ($i=1,\ldots,N$) in which each household is found.
$$\hat{y_i} = {\alpha + \beta_{j[i]} x_i}$$
The value of $\beta$ for each county $j[i]$, is Normally distributed with mean $\mu_{\beta}$ and variance $\sigma_{\beta}^2$. $\sigma_{\beta}$ is sampled from a Uniform distribution, between 0 and 100. $\mu_{\beta}$ is an unconstrained real value. There is a common mean and variance underlying each of the floor effects for each county.
$$\beta_{j[i]} \sim N(\mu_{\beta}, \sigma_{\beta}^2)$$ $$\sigma_{\beta} \sim U(0, 100)$$
The value of $\alpha$ is assumed to be Normally distributed about zero, with unit variance:
$$\alpha \sim N(0, 1)$$
In data
:
J
is the number of counties (int
)N
is the number of samples (int
)county
is a list of N
values from 1-85, specifying the county index each measurementx
is a vector
of indices for which floor the radon measurements were taken at each householdy
is a vector
of log(radon) measurements, one per household/sample.We define parameters:
a
(real
) representing $\alpha$, the prevailing state-wide radon levelb
(vector
, one value per county), representing $\beta$, the vector of floor effects for each county.mu_b
, a real
corresponding to $\mu_{\beta}$, the mean floor effect underlying the distribution from which the county levels are drawn.sigma_b
is $\sigma_{\beta}$, the standard deviation of the floor effect distribution underlying the county level effects: variability of county means about the average.sigma_y
is $\sigma_y$, the standard deviation of the measurement/sampling error: residual error of the observations.varying_slope = """
data {
int<lower=0> J;
int<lower=0> N;
int<lower=1,upper=J> county[N];
vector[N] x;
vector[N] y;
}
parameters{
real a;
vector[J] b;
real mu_b;
real<lower=0,upper=100> sigma_b;
real<lower=0,upper=100> sigma_y;
}
transformed parameters {
vector[N] y_hat;
for(i in 1:N)
y_hat[i] <- a + x[i] * b[county[i]];
}
model {
sigma_b ~ uniform(0, 100);
b ~ normal(mu_b, sigma_b);
a ~ normal(0, 1);
sigma_y ~ uniform(0, 100);
y ~ normal(y_hat, sigma_y);
}
"""
Again, we map Python variables to those in the model, and run the fit:
n_county = clean_data.srrs_mn.groupby('county')['idnum'].count() # number of samples from each county
varying_slope_data = {'N': len(clean_data.log_radon),
'J': len(n_county),
'county': clean_data.county + 1,
'x': clean_data.floor_measure,
'y': clean_data.log_radon}
varying_slope_fit = pystan.stan(model_code=varying_slope,
data=varying_slope_data,
iter=1000, chains=2)
We can then collect the county-level estimates of floor effect, the slope of the model, $\beta_{j[i]}$, from b
(1000 iterations x 85 counties):
b_sample = pd.DataFrame(varying_slope_fit['b'])
We can visualise the distribution of these estimates, by county, with a boxplot:
# Boxplot of estimated floor effect by county
plt.figure(figsize=(16, 6))
g = sns.boxplot(data=b_sample, whis=np.inf, color="c")
g.set_xticklabels(clean_data.mn_counties, rotation=90) # label counties
g.set_title("Estimate of floor effect, by county")
g;
From this plot we can see that the estimate for $\beta$ is approximately -0.6 in almost all counties, and there is very little variation in the median estimate, though the range of uncertainty is, as before, greatest in the counties with fewest households, and least in the counties with the most households.
We can plot the linear relationships by county, to see how the fitted model estimates of slope vary:
xvals = np.arange(2) # floor
bp = varying_slope_fit['a'].mean() # mean a (intercept) across state
mp = varying_slope_fit['b'].mean(axis=0) # mean b (slope/floor effect) by county
for mi in mp:
plt.plot(xvals, mi * xvals + bp, 'bo-', alpha=0.3)
plt.xticks([0, 1])
plt.xlim(-0.1, 1.1)
plt.ylim(0, 1.5)
plt.title("fitted relationships by county")
plt.xlabel("floor")
plt.ylabel("log(radon)");
It is clear from this plot that we have fitted the same prevailing radon level to each county, but with a different floor effect in each county. There are four or five counties with smaller floor effects, but the bulk of counties form a set of similar fits.