Bayesian Multilevel Modelling using PyStan

This is a tutorial, following through Chris Fonnesbeck's primer on using PyStan with Bayesian Multilevel Modelling.

2. Data Import and Cleaning

In [1]:
%pylab inline

import numpy as np
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

Load radon data

We import the radon data, which can be found in the file data/srrs2.dat.

For cleanup, we strip whitespace from column headers, restrict data to the state of Minnesota (coded as MN) and add a unique numerical identifier (fips) for each county, derived from the state identifier stfips and the county identifier cntyfips.

In [2]:
# Import radon data
srrs2 = pd.read_csv('data/srrs2.dat')
srrs2.columns = srrs2.columns.map(str.strip)

# Make a combined state and county ID, by household
srrs_mn = srrs2.assign(fips=srrs2.stfips * 1000 + srrs2.cntyfips)[srrs2.state == 'MN']

We inspect the first few lines of the data with the .head() method, and examine the columns that are available:

In [3]:
# Check first few lines
srrs_mn.head()
Out[3]:
idnum state state2 stfips zip region typebldg floor room basement ... startdt stopdt activity pcterr adjwt dupflag zipflag cntyfips county fips
5080 5081 MN MN 27 55735 5 1 1 3 N ... 12088 12288 2.2 9.7 1146.499190 1 0 1 AITKIN 27001
5081 5082 MN MN 27 55748 5 1 0 4 Y ... 11888 12088 2.2 14.5 471.366223 0 0 1 AITKIN 27001
5082 5083 MN MN 27 55748 5 1 0 4 Y ... 20288 21188 2.9 9.6 433.316718 0 0 1 AITKIN 27001
5083 5084 MN MN 27 56469 5 1 0 4 Y ... 122987 123187 1.0 24.3 461.623670 0 0 1 AITKIN 27001
5084 5085 MN MN 27 55011 3 1 0 4 Y ... 12888 13088 3.1 13.8 433.316718 0 0 3 ANOKA 27003

5 rows × 26 columns

In [4]:
# What columns are available?
srrs_mn.columns
Out[4]:
Index(['idnum', 'state', 'state2', 'stfips', 'zip', 'region', 'typebldg',
       'floor', 'room', 'basement', 'windoor', 'rep', 'stratum', 'wave',
       'starttm', 'stoptm', 'startdt', 'stopdt', 'activity', 'pcterr', 'adjwt',
       'dupflag', 'zipflag', 'cntyfips', 'county', 'fips'],
      dtype='object')

Load uranium data

We also import uranium data (found in the file data/cty.dat) for each county in the state.

We create the same numerical identifier for county (fips) as before, from stfips and ctfips.

In [5]:
# Obtain the uranium level as a county-level predictor
cty = pd.read_csv('data/cty.dat')
cty_mn = cty[cty.st == 'MN'].copy()  # MN only data

# Make a combined state and county id, by county
cty_mn['fips'] = 1000 * cty_mn.stfips + cty_mn.ctfips

We check the first few lines (the uranium concentration is in the column Uppm), and what columns are available:

In [6]:
# Check first few lines of data
cty_mn.head()
Out[6]:
stfips ctfips st cty lon lat Uppm fips
1326 27 1 MN AITKIN -93.415 46.608 0.502054 27001
1327 27 3 MN ANOKA -93.246 45.273 0.428565 27003
1328 27 5 MN BECKER -95.674 46.935 0.892741 27005
1329 27 7 MN BELTRAMI -94.937 47.974 0.552472 27007
1330 27 9 MN BENTON -93.998 45.699 0.866849 27009
In [7]:
# What columns are in the data?
cty_mn.columns
Out[7]:
Index(['stfips', 'ctfips', 'st', 'cty', 'lon', 'lat', 'Uppm', 'fips'], dtype='object')

Merging datasets

It is convenient to bring all the data into a single dataframe with radon and uranium data byhousehold, so we merge both tables together on the basis of the unique county identifier, to assign uranium data across all households in a county.

We check that the column Uppm has been merged.

In [8]:
# Combine data into a single dataframe
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')  # Get uranium level by household (on county basis)
srrs_mn = srrs_mn.drop_duplicates(subset='idnum')  # Lose duplicate houses
u = np.log(srrs_mn.Uppm)  # log-transform uranium level
n = len(srrs_mn)  # number of households
In [9]:
# Check first few lines of data
srrs_mn.head()
Out[9]:
idnum state state2 stfips zip region typebldg floor room basement ... stopdt activity pcterr adjwt dupflag zipflag cntyfips county fips Uppm
0 5081 MN MN 27 55735 5 1 1 3 N ... 12288 2.2 9.7 1146.499190 1 0 1 AITKIN 27001 0.502054
1 5082 MN MN 27 55748 5 1 0 4 Y ... 12088 2.2 14.5 471.366223 0 0 1 AITKIN 27001 0.502054
2 5083 MN MN 27 55748 5 1 0 4 Y ... 21188 2.9 9.6 433.316718 0 0 1 AITKIN 27001 0.502054
3 5084 MN MN 27 56469 5 1 0 4 Y ... 123187 1.0 24.3 461.623670 0 0 1 AITKIN 27001 0.502054
4 5085 MN MN 27 55011 3 1 0 4 Y ... 13088 3.1 13.8 433.316718 0 0 3 ANOKA 27003 0.428565

5 rows × 27 columns

Indexing counties

We create a dictionary associating each county with a unique index code, which will be used to build variables to be passed to Stan.

In [10]:
# Index counties with a lookup dictionary
srrs_mn.county = srrs_mn.county.str.strip()
mn_counties = srrs_mn.county.unique()
counties = len(mn_counties)
county_lookup = dict(zip(mn_counties, range(len(mn_counties))))

For construction of a Stan model, it is convenient to have the relevant variables as local copies - this aids readability.

  • county: index code for each county
  • radon: radon activity
  • log_radon: log radon activity
  • floor_measure: which floor measurement was taken
In [11]:
# Make local copies of variables
county = srrs_mn['county_code'] = srrs_mn.county.replace(county_lookup).values
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values

Helper script

To support the following notebooks, the data import, clean-up and variable creation code above is made available in the Python module clean_data.py. This will be imported at the top of each notebook as

import clean_data