Bayesian Multilevel Modelling using PyStan¶

This is a tutorial, following through Chris Fonnesbeck's primer on using PyStan with Bayesian Multilevel Modelling.

2. Data Import and Cleaning¶

%pylab inline

import numpy as np
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

Load radon data¶

We import the radon data, which can be found in the file data/srrs2.dat.

For cleanup, we strip whitespace from column headers, restrict data to the state of Minnesota (coded as MN) and add a unique numerical identifier (fips) for each county, derived from the state identifier stfips and the county identifier cntyfips.

# Import radon data
srrs2 = pd.read_csv('data/srrs2.dat')
srrs2.columns = srrs2.columns.map(str.strip)

# Make a combined state and county ID, by household
srrs_mn = srrs2.assign(fips=srrs2.stfips * 1000 + srrs2.cntyfips)[srrs2.state == 'MN']

We inspect the first few lines of the data with the .head() method, and examine the columns that are available:

# Check first few lines
srrs_mn.head()

# What columns are available?
srrs_mn.columns

Index(['idnum', 'state', 'state2', 'stfips', 'zip', 'region', 'typebldg',
       'floor', 'room', 'basement', 'windoor', 'rep', 'stratum', 'wave',
       'starttm', 'stoptm', 'startdt', 'stopdt', 'activity', 'pcterr', 'adjwt',
       'dupflag', 'zipflag', 'cntyfips', 'county', 'fips'],
      dtype='object')

Load uranium data¶

We also import uranium data (found in the file data/cty.dat) for each county in the state.

We create the same numerical identifier for county (fips) as before, from stfips and ctfips.

# Obtain the uranium level as a county-level predictor
cty = pd.read_csv('data/cty.dat')
cty_mn = cty[cty.st == 'MN'].copy()  # MN only data

# Make a combined state and county id, by county
cty_mn['fips'] = 1000 * cty_mn.stfips + cty_mn.ctfips

We check the first few lines (the uranium concentration is in the column Uppm), and what columns are available:

# Check first few lines of data
cty_mn.head()

# What columns are in the data?
cty_mn.columns

Index(['stfips', 'ctfips', 'st', 'cty', 'lon', 'lat', 'Uppm', 'fips'], dtype='object')

Merging datasets¶

It is convenient to bring all the data into a single dataframe with radon and uranium data byhousehold, so we merge both tables together on the basis of the unique county identifier, to assign uranium data across all households in a county.

We check that the column Uppm has been merged.

# Combine data into a single dataframe
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')  # Get uranium level by household (on county basis)
srrs_mn = srrs_mn.drop_duplicates(subset='idnum')  # Lose duplicate houses
u = np.log(srrs_mn.Uppm)  # log-transform uranium level
n = len(srrs_mn)  # number of households

# Check first few lines of data
srrs_mn.head()

Indexing counties¶

We create a dictionary associating each county with a unique index code, which will be used to build variables to be passed to Stan.

# Index counties with a lookup dictionary
srrs_mn.county = srrs_mn.county.str.strip()
mn_counties = srrs_mn.county.unique()
counties = len(mn_counties)
county_lookup = dict(zip(mn_counties, range(len(mn_counties))))

For construction of a Stan model, it is convenient to have the relevant variables as local copies - this aids readability.

county: index code for each county
radon: radon activity
log_radon: log radon activity
floor_measure: which floor measurement was taken

# Make local copies of variables
county = srrs_mn['county_code'] = srrs_mn.county.replace(county_lookup).values
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values

Helper script¶

To support the following notebooks, the data import, clean-up and variable creation code above is made available in the Python module clean_data.py. This will be imported at the top of each notebook as

import clean_data

	idnum	state	state2	stfips	zip	region	typebldg	floor	room	basement	...	startdt	stopdt	activity	pcterr	adjwt	dupflag	cntyfips	county	fips
5080	5081	MN	MN	27	55735	5	1	1	3	N	...	12088	12288	2.2	9.7	1146.499190	1	1	AITKIN	27001
5081	5082	MN	MN	27	55748	5	1	0	4	Y	...	11888	12088	2.2	14.5	471.366223	0	1	AITKIN	27001
5082	5083	MN	MN	27	55748	5	1	0	4	Y	...	20288	21188	2.9	9.6	433.316718	0	1	AITKIN	27001
5083	5084	MN	MN	27	56469	5	1	0	4	Y	...	122987	123187	1.0	24.3	461.623670	0	1	AITKIN	27001
5084	5085	MN	MN	27	55011	3	1	0	4	Y	...	12888	13088	3.1	13.8	433.316718	0	3	ANOKA	27003

	stfips	ctfips	st	cty	lon	lat	Uppm	fips
1326	27	1	MN	AITKIN	-93.415	46.608	0.502054	27001
1327	27	3	MN	ANOKA	-93.246	45.273	0.428565	27003
1328	27	5	MN	BECKER	-95.674	46.935	0.892741	27005
1329	27	7	MN	BELTRAMI	-94.937	47.974	0.552472	27007
1330	27	9	MN	BENTON	-93.998	45.699	0.866849	27009

	idnum	state	state2	stfips	zip	region	typebldg	floor	room	basement	...	stopdt	activity	pcterr	adjwt	dupflag	cntyfips	county	fips	Uppm
0	5081	MN	MN	27	55735	5	1	1	3	N	...	12288	2.2	9.7	1146.499190	1	1	AITKIN	27001	0.502054
1	5082	MN	MN	27	55748	5	1	0	4	Y	...	12088	2.2	14.5	471.366223	0	1	AITKIN	27001	0.502054
2	5083	MN	MN	27	55748	5	1	0	4	Y	...	21188	2.9	9.6	433.316718	0	1	AITKIN	27001	0.502054
3	5084	MN	MN	27	56469	5	1	0	4	Y	...	123187	1.0	24.3	461.623670	0	1	AITKIN	27001	0.502054
4	5085	MN	MN	27	55011	3	1	0	4	Y	...	13088	3.1	13.8	433.316718	0	3	ANOKA	27003	0.428565