This is a tutorial, following through Chris Fonnesbeck's primer on using PyStan with Bayesian Multilevel Modelling.
%pylab inline
import numpy as np
import pandas as pd
import pystan
import seaborn as sns
import clean_data
sns.set_context('notebook')
Stan model¶We're going to make predictions using the contextual effects model developed in notebook 12-contextual_effects.ipynb. There are two types of prediction that we can make, using a multilevel model:
To do this, we sample from the model, choosing values for variable that specify the properties of the individual for which we want to make the prediction.
For instance, if we wanted to make a prediction for a new house, having no basement, in St Louis county, we would sample from the model as:
$$\tilde{y}_{St Louis} \sim N(\alpha_{St Louis} + \beta(x_i = 1), \sigma_y^2)$$
To make a prediction from a model, we do the following:
stl_mu ($\mu_{St Louis}$, real) to reflect$$\mu_{St Louis} = \alpha_{St Louis} + \beta_1 u_{St Louis} + \beta_2 + \beta_3 \bar{x}_{St Louis}$$
Stan specification: generated quantities. In this we define the reported value y_stl ($\tilde{y}_{St Louis}$, the predicted value for a new house with no basement in St Louis):$$\tilde{y}_{St Louis} \sim N(\mu_{St Louis}, \sigma_y^2)$$
This requires that we also detail new data:
stl: $j[i]$, int - the index for household $i$ corresponding to St Louis.u_stl: $u_{St Louis}$, a real representing the measured uranium level in St Louisxbar_stl: $\bar{x}_{St Louis}$, real, representing the mean floor measurement in St Louisy_stl: real to hold the predicted value for $\tilde{y}_{St Louis}$We also introduce normal_rng() for the first time - a Stan function to produce a realisation from a Normal variate.
Otherwise the model is identical to that in the notebook 12-contextual_effects.ipynb.
contextual_prediction = """
data {
  int<lower=0> J; 
  int<lower=0> N; 
  int<lower=1,upper=J> county[N];
  vector[N] u;
  vector[N] x;
  vector[N] x_mean;
  vector[N] y;
  
  int<lower=0,upper=J> stl;
  real u_stl;
  real xbar_stl;
} 
parameters {
  vector[J] a;
  vector[3] b;
  real mu_a;
  real<lower=0,upper=100> sigma_a;
  real<lower=0,upper=100> sigma_y;
} 
transformed parameters {
  vector[N] y_hat;
  real stl_mu;
  for (i in 1:N)
    y_hat[i] <- a[county[i]] + u[i] * b[1] + x[i] * b[2] + x_mean[i] * b[3];
    
  stl_mu <- a[stl+1] + u_stl * b[1] + b[2] + xbar_stl * b[3];
}
model {
  mu_a ~ normal(0, 1);
  a ~ normal(mu_a, sigma_a);
  b ~ normal(0, 1);
  y ~ normal(y_hat, sigma_y);
}
generated quantities {
  real y_stl;
  
  y_stl <- normal_rng(stl_mu, sigma_y);
}
"""
We map Python variables to those in the model, and run the model ust as if we were obtaining a fitted model:
# number of samples from each county
n_county = clean_data.srrs_mn.groupby('county')['idnum'].count()  
# index value for St Louis
idx = clean_data.county_lookup['ST LOUIS']
# Create new variable for mean of floor across counties
xbar = clean_data.srrs_mn.groupby('county')['floor'].mean().rename(clean_data.county_lookup).values
x_mean = xbar[clean_data.county]  # by household
contextual_prediction_data = {'N': len(clean_data.log_radon),
                              'J': len(n_county),
                              'county': clean_data.county+1, # Stan counts starting at 1
                              'u': clean_data.u,
                              'x_mean': x_mean,
                              'x': clean_data.floor_measure,
                              'y': clean_data.log_radon,
                              'stl': idx,
                              'u_stl': np.log(clean_data.cty_mn[clean_data.cty_mn.cty=='STLOUIS'].Uppm.values)[0],
                              'xbar_stl': xbar[idx]}
contextual_prediction_fit = pystan.stan(model_code=contextual_prediction,
                                        data=contextual_prediction_data, 
                                        iter=1000, chains=2)
We're interested here in the value of y_stl ($\tilde{y}_{St Louis}$), which can be drawn from the model fit sample just as if it were any other parameter.
We can plot the distribution of this simulated value:
y_stl_mean = contextual_prediction_fit['y_stl'].mean()
contextual_prediction_fit.plot(pars=['y_stl',])
The mean value sampled from this fit is ≈0.3, so we should expect the measured radon level in a new house with no basement in St Louis to be ≈exp(0.3) ≈ 1.35, though we can see that the range of predicted values is rather wide.