05/09/2018

Etherpad

Please do take notes on the course etherpad:

http://pad.software-carpentry.org/2018-09-05-dundee

  • Communal notes: share your understanding, and benefit from others
  • Ask questions: get detailed answers with links and examples
  • A record/reference for after the course

Learning Objectives

  • Fundamentals of RStudio (refresher)
  • RStudio project creation and version control (refresher)
  • Flow control in R (refresher)
  • Functions in R (refresher)
  • Literate programming with RMarkdown and knitr
  • Good practice for programming and project management

We’re assuming some familiarity with:

  • R syntax, data types and structures (especially data.frames)
  • variables and variable assignment (<-)
  • using R packages
  • R base graphics/ggplot2

SECTION 01: RStudio

Learning Objectives

  • The elements of an RStudio session
    • interactive code
    • writing scripts/documents
    • live view of graphical output
    • getting help
    • interaction with the filesystem
    • project and environment management

What is RStudio?

  • RStudio is an integrated development environment (IDE) - all platforms
  • Interaction with R (console/‘scratchpad’)
  • Script/code editor
  • Graphics/visualisation
  • Project management (git integration)

RStudio overview - Interactive Demo

INTERACTIVE DEMO

Built-in Functions

  • Function (log(), sin() etc.) ≈ “canned script”
    • automate complicated tasks
    • make code more readable and reusable
  • Some functions are built-in (in base packages, e.g. sqrt(), lm(), plot())
  • Some functions are imported from libraries
  • Functions usually take arguments (input)
  • Functions often return values (output)

INTERACTIVE DEMO

Getting Help For Built-in Functions

INTERACTIVE DEMO

?fname                 # help page for fname
help(fname)            # help page for fname
??fname                # any mention of fname
args(fname)            # arguments for fname
vignette(fname)        # worked examples for fname
vignette()             # show all available vignettes
help.search("text")    # any mention of "text"

Numerical comparisons

  • Computers can have limits to numeric precision
    • (they do what you tell them, not necessarily what you want)

INTERACTIVE DEMO

> pi - 1e-8 == pi
[1] FALSE
> all.equal(pi - 1e-8, pi)
[1] TRUE
> log(0.01 ^ 200)
[1] -Inf
> 200 * log(0.01)
[1] -921.034

Working in RStudio

We can write code in several ways in RStudio

  • At the console (you’ve done this)
  • In a script
  • As an interactive notebook or markdown file
  • As a Shiny app

We’re going to create a new dataset and R script.

  • Putting code in a script makes it easier to modify, share and run

INTERACTIVE DEMO

SECTION 02:
My First RStudio Project

Learning Objectives

  • Good practice for RStudio project structure
  • Load data into an RStudio project
  • Produce summary statistics of data
  • Extract subsets of data
  • Plotting data in R

Good Project Management Practices

No single ‘right way’ - only good principles

  • Use a single working directory per project/analysis
    • easier to move, share, and find files
    • use relative paths to locate files
  • Treat raw data as read-only
    • keep in separate subfolder (data?)
  • Clean data ready for work/analysis
    • keep cleaned/modified data in separate folder (clean_data?)
  • Consider output generated by analysis to be disposable
    • can be regenerated by running analysis/code
    • don’t place under version control

Example Directory Structure

Project Management in RStudio

  • RStudio tries to help you manage your projects
    • R Project concept - files and subdirectory structure
    • integration with version control
    • switching between multiple projects within RStudio
    • stores project history

Let’s create a project in RStudio

INTERACTIVE DEMO

Obtaining Data

Investigating gapminder

INTERACTIVE DEMO

gapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)
str(gapminder)              # structure of the data.frame
typeof(gapminder$year)      # data type of a column
length(gapminder)           # length of the data.frame
nrow(gapminder)             # number of rows in data.frame
ncol(gapminder)             # number of columns in data.frame
dim(gapminder)              # number of rows and columns in data.frame
colnames(gapminder)         # column names from data.frame
head(gapminder)             # first few rows of dataframe
summary(gapminder)          # summary of data in data.frame columns

SECTION 03: Program Flow Control

Learning Objectives

  • How to make data-dependent choices in R
  • Use if() and else()
  • Repeat operations in R
  • Use for() loops
  • vectorisation to avoid repeating operations

if() … else

  • We often want to perform operations (or not) conditional on a piece of data
    • The if() and else construct is useful for this
# if
if (condition is true) {
  PERFORM ACTION
}

# if ... else
if (condition is true) {
  PERFORM ACTION
} else {  # i.e. if the condition is false,
  PERFORM ALTERNATIVE ACTION
}

INTERACTIVE DEMO

Challenge (5min)

In the console, can you use an if() statement to report whether there are any records from 2002 in the gapminder dataset?

Can you do the same for 2012?

HINT: Look at the help for the any() function

for() loops

  • for() loops are a very common construction in programming
    • for each <item> in a group, <do something (with the item)>
  • Not as useful in R as in some other languages
for(iterator in set of values){
  do a thing
}

INTERACTIVE DEMO

while() loops

  • while() loops are useful when you need to do something while some condition is true
while(this condition is true){
  do a thing
}

INTERACTIVE DEMO

Challenge (5min)

In your script, can you use a for() loop and an if() statement to loop over each country in the gapminder data, and report TRUE when it starts with the letter M, and FALSE when it does not?

HINT: Use R’s built in startsWith() function

HINT: Use levels() to get unique country names

Commit the script to the repository when you are done

Vectorisation

  • for() and while() loops are useful for program control (making decisions), but are not efficient for data manipulation in R
  • Many operations in R are vectorised
    • Applying functions to a vector, applies to all elements in that vector
    • No need to loop

You’ve already seen and used much of this behaviour

INTERACTIVE DEMO

x < 1:4
x * 2
y <- 6:9
x + y

Challenge (5min)

In your script, can you use vectorisation to identify all countries in the gapminder data that start with the letter M?

HINT: Use levels() to get unique country names

HINT: Use R’s built in startsWith() function

HINT: Use logical indexing

SECTION 04: Functions

Learning Objectives

  • Why functions are important
  • How to write a new function
  • Defining a function that takes arguments
  • Returning a value from a function
  • Set default values for function arguments

Why Functions?

  • Functions let us run a complex series of commands in one go
    • under a memorable/descriptive name
    • invoked with that name
    • with a defined set of inputs and outputs
    • to perform a logically coherent task

Functions are the building blocks of programming

Small functions with one obvious, clearly-defined task are good practice

Defining a Function

  • You will often need to write your own functions
  • They take a standard form
<function_name> <- function(<arg1>, <arg2>) {
  <do something>
  return(<result>)
}

INTERACTIVE DEMO

my_sum <- function(a, b) {
  the_sum <- a + b
  return(the_sum)
}

Documentating Functions

  • So far, you’ve been able to use R’s built-in help to see function documentation
    • This isn’t available for your own functions unless you write it

Your future self will thank you!

(and so will your colleagues)

Write programs for people, not for computers

  • State what the code does (and why)
  • Define inputs and outputs
  • Give an example

INTERACTIVE DEMO

Function Arguments

  • We can define functions that take multiple arguments
  • We can also define default values for arguments

INTERACTIVE DEMO

# Report countries in gapminder data
list_countries <- function(data, letter=NULL) {
  countries <- levels(data$country)
  if (!is.null(letter)) {
    matches <- startsWith(countries, letter)
    countries <- countries[matches]
  }
  return(countries)
}

SECTION 05:
Dynamic Reports

Learning Objectives

  • Create dynamic, reproducible reports
  • RMarkdown syntax
  • Inline R code in documents
  • Producing documents in .pdf, .html, etc.

Literate Programming

  • A programming paradigm introduced by Donald Knuth
  • The program (or analysis) is explained in natural language
    • The source code is interspersed through the document
  • The whole document is executable

We can produce these documents in RStudio

Create an R Markdown File

  • R Markdown files embody Literate Programming in R
  • File \(\rightarrow\) New File \(\rightarrow\) R Markdown
  • Enter a title
  • Save the file (gets the extension .Rmd)

Components of an R Markdown File

  • Header information is fenced by ---
---
title: "Literate Programming"
author: "Leighton Pritchard"
---
  • Natural language is written as plain text
This is an R Markdown document. Markdown is a simple formatting syntax
  • R code (which is executable) is fenced by backticks

Click on Knit

Creating a Report

SECTION 06: dplyr

Learning Objectives

  • How to manipulate data.frames with the six verbs of dplyr
    • a ‘grammar of data manipulation’
  • select()
  • filter()
  • group_by()
  • summarize()
  • mutate()
  • %>% (pipe)

What and Why is dplyr?

  • dplyr is a package in the Tidyverse
  • Facilitates analysis by groups
    • Helps avoid repetition
> mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
> mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
> mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15

Avoiding repetition (though automation) makes code

  • robust
  • reproducible

Split-Apply-Combine

select()

select(gapminder, year, country, gdpPercap)
gapminder %>% select(year, country, gdpPercap)

INTERACTIVE DEMO

filter()

  • filter() selects rows on the basis of a specified condition

INTERACTIVE DEMO

filter(gapminder, continent=="Europe")

# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
              filter(continent == "Europe") %>%
              select(year, country, gdpPercap)

Challenge (5min)

Can you write a single line (which may span multiple lines in your RMarkdown by including pipes) to produce a dataframe containing:

  • year, country, life expectancy data
  • only for African nations

How many rows does the dataframe have?

group_by()

group_by(gapminder, continent)
gapminder %>% group_by(continent)

INTERACTIVE DEMO

summarize()

# Produce table of mean GDP by continent
gapminder %>%
    group_by(continent) %>%
    summarize(meangdpPercap=mean(gdpPercap))

INTERACTIVE DEMO

Challenge (5min)

  • Can you calculate the average life expectancy per country in the gapminder data?
  • Which nation has longest life expectancy, and which the shortest?

count() and n()

  • Two useful functions related to summarize()
    • count()/tally(): a function that reports a table of counts by group
    • n(): a function used within summarize(), filter() or mutate() to represent count by group

INTERACTIVE DEMO

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

gapminder %>%
  group_by(continent) %>%
  summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))

mutate()

  • mutate() is a function allowing creation of new variables in a chain of dplyr verbs.

INTERACTIVE DEMO

# Calculate GDP in $billion
gdp_bill <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9)

# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))