This is a tutorial, following through Chris Fonnesbeck's primer on using PyStan with Bayesian Multilevel Modelling.
%pylab inline
import numpy as np
import pandas as pd
We import the radon data, which can be found in the file data/srrs2.dat
.
For cleanup, we strip whitespace from column headers, restrict data to the state of Minnesota (coded as MN
) and add a unique numerical identifier (fips
) for each county, derived from the state identifier stfips
and the county identifier cntyfips
.
# Import radon data
srrs2 = pd.read_csv('data/srrs2.dat')
srrs2.columns = srrs2.columns.map(str.strip)
# Make a combined state and county ID, by household
srrs_mn = srrs2.assign(fips=srrs2.stfips * 1000 + srrs2.cntyfips)[srrs2.state == 'MN']
We inspect the first few lines of the data with the .head()
method, and examine the columns that are available:
# Check first few lines
srrs_mn.head()
# What columns are available?
srrs_mn.columns
We also import uranium data (found in the file data/cty.dat
) for each county in the state.
We create the same numerical identifier for county (fips
) as before, from stfips
and ctfips
.
# Obtain the uranium level as a county-level predictor
cty = pd.read_csv('data/cty.dat')
cty_mn = cty[cty.st == 'MN'].copy() # MN only data
# Make a combined state and county id, by county
cty_mn['fips'] = 1000 * cty_mn.stfips + cty_mn.ctfips
We check the first few lines (the uranium concentration is in the column Uppm
), and what columns are available:
# Check first few lines of data
cty_mn.head()
# What columns are in the data?
cty_mn.columns
It is convenient to bring all the data into a single dataframe with radon and uranium data byhousehold, so we merge both tables together on the basis of the unique county identifier, to assign uranium data across all households in a county.
We check that the column Uppm
has been merged.
# Combine data into a single dataframe
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips') # Get uranium level by household (on county basis)
srrs_mn = srrs_mn.drop_duplicates(subset='idnum') # Lose duplicate houses
u = np.log(srrs_mn.Uppm) # log-transform uranium level
n = len(srrs_mn) # number of households
# Check first few lines of data
srrs_mn.head()
We create a dictionary associating each county with a unique index code, which will be used to build variables to be passed to Stan
.
# Index counties with a lookup dictionary
srrs_mn.county = srrs_mn.county.str.strip()
mn_counties = srrs_mn.county.unique()
counties = len(mn_counties)
county_lookup = dict(zip(mn_counties, range(len(mn_counties))))
For construction of a Stan
model, it is convenient to have the relevant variables as local copies - this aids readability.
county
: index code for each countyradon
: radon activitylog_radon
: log radon activityfloor_measure
: which floor measurement was taken# Make local copies of variables
county = srrs_mn['county_code'] = srrs_mn.county.replace(county_lookup).values
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values
To support the following notebooks, the data import, clean-up and variable creation code above is made available in the Python module clean_data.py
. This will be imported at the top of each notebook as
import clean_data