NOTES.md
- R
lessonThese notes are for the tutors on the two-part R
Software Carpentry course, taught 11-12th January 2018 at NUI Galway.
To clear a console environment in R
:
rm(list=ls())
R
: Part OneR
and RStudio
- how to GET AROUND R
AND WHAT RStudio
CAN DOR
R
and RStudio
SLIDE: Learning Objectives
R
and RSTUDIO
are.Excel
SLIDE: What is R
?
R
is a PROGRAMMING LANGUAGE, and the SOFTWARE that runs programs written in that language.R
is AVAILABLE ON ALL PLATFORMSThis can sometimes be confusing - IF AT ANY POINT I AM UNCLEAR, PLEASE ASK!
R
BEFORE? - GREEN STICKY
R
, please could you be available to help one of your neighbours who has not, if they have any questions. Look around you - if there’s someone nearby with a green sticky, say ‘hi’!R
?
R
is FREE, and very WIDELY-USED across a range of disciplines.SLIDE: But I already know Excel
EXCEL
. Excel
is brilliant at what it’s meant to do. It’s POWERFUL and INTUITIVE.But R
HAS MANY ADVANTAGES FOR REPRODUCIBLE AND COMPLEX ANALYSIS
R
, when you do anything to the original data, that original data remains unmodified (unless you overwrite the file).Excel
however it’s easy to overwrite data with copy-and-paste (and many bad things have happened as a result) - see Mike Croucher’s talk for examples.R
IS A PROGRAM, every step is written down explicitly, and is transparent and understandable by someone else.
Excel
, there is no clear record of where you moved your mouse, or what you copied and pasted, and it’s not immediately obvious how your formulas work.R
code is EASY TO SHARE AND ADAPT, and to apply again to a different or a modified input dataset. It’s easy to publish the analyses via online resources, such as GitHub.R
code can also be RUN ON EXTREMELY LARGE DATASETS, quickly. That’s much harder in Excel
.SLIDE: What is RStudio
?
RStudio
is an INTEGRATED DEVELOPMENT ENVIRONMENT - which is to say it’s a very powerful tool for writing and using R
and programs in the R
language.
On the right is the Windows version, with an EXAMPLE ANALYSIS AND VISUALISATION
R
DIRECTLY TO EXPERIMENT WITH DATA## SECTION 02: Getting To Know RStudio
SLIDE: Learning Objectives
RStudio
IDERStudio
GIT
LESSON because RStudio
integrates naturally with git
to keep all your analysis and code under version control.R
syntax, and SEE HOW R
REPRESENTS DATA and how to PROGRAM IN R
R
SLIDE: RStudio
overview - Interactive Demo
RStudio
(click icon/go into start menu and select RStudio/etc.)
RSTUDIO
R
console: you can type here and get instant feedbackR
in the interactive console to get used to some of the features of the language, and RStudio
. DEMO CODE: ASK PEOPLE TO TYPE ALONG
R
EXPECTS INPUT> 1 + 100
[1] 101
> 30 / 3
[1] 10
[1]
this indicates the line with output in itR
will wait for you to complete it
> 1 +
+
+
WHEN R
EXPECTS MORE INPUTEsc
(Ctrl-C
) to exit> 1 +
+ 6
[1] 7
> 1 +
+
>
R
obeys the usual PRECEDENCE OPERATIONS ( (
, **
/^
, /
, *
, +
, -
)
> 3 + 5 * 2
[1] 13
> (3 + 5) * 2
[1] 16
> 3 + 5 * 2 ^ 2
[1] 23
> 3 + 5 * (2 ^ 2)
[1] 23
HISTORY
TAB SHOWS ALL COMMANDS USEDR
will report in SCIENTIFIC NOTATION
> 2 / 1000
[1] 0.002
> 2 / 10000
[1] 2e-04
> 5e3
[1] 5000
R
has many STANDARD MATHEMATICAL FUNCTIONS> sin(1)
[1] 0.841471
> log(1)
[1] 0
> log10(10)
[1] 1
> log(10)
[1] 2.302585
log()
and log10()
?R
BUILT-IN HELP
?
then the function name> ?log
sin()
)If you’re not sure about spelling, the editor has AUTOCOMPLETION which will suggest all possible endings for something you type (try log
) - USE TAB TO SEE AUTOCOMPLETIONS
R
TRUE
or FALSE
. DEMO CODEall.equal()
(machine numeric tolerance) ASK IF THERE’S ANYONE FROM MATHS/PHYSICS> 1 == 1
[1] TRUE
> 1 != 2
[1] TRUE
> 1 < 2
[1] TRUE
> 1 <= 1
[1] TRUE
> 1 > 0
[1] TRUE
> 1 >= -9
[1] TRUE
> all.equal(1.0, 1.0)
[1] TRUE
> all.equal(1.0, 1.1)
[1] "Mean relative difference: 0.1"
> all.equal(pi-1e-7, pi)
[1] "Mean relative difference: 3.183099e-08"
> all.equal(pi-1e-8, pi)
[1] TRUE
> pi-1e-8 == pi
[1] FALSE
SLIDE: Challenge 01
SLIDE: Variables
R
Name
, and it contains the word Samia
Name
, and we might ask questions like:
Name
?”, meaning “What is the length of the word in the box called Name
?” (answer: 5)SLIDE: Variables - Interactive Demo
R
, variables are assigned with the ASSIGNMENT OPERATOR <-
1/40
to the variable x
> x <- 1 / 40
x
now exists, and contains the value 0.025
- a DECIMAL APPROXMATION of the fraction 1/40
> x
[1] 0.025
x
is defined, thereThe Environment window in RStudio
tells you the name and content of every variable currently active in your R
session.
> log(x)
[1] -3.688879
> sin(x)
[1] 0.0249974
> x + x
[1] 0.05
> 2 * x
[1] 0.05
> x ^ 2
[1] 0.000625
x
IN THE ENVIRONMENT WINDOW> x <- 100
> x <- x + 5
> name <- "Samia"
> name
[1] "Samia"
R
is not always intuitive
> length(name)
[1] 1
> nchar(name)
[1] 5
SLIDE: Functions
sqrt()
functionbase
functionsOTHER FUNCTIONS FOR SPECIFIC TASKS CAN BE BROUGHT IN, THROUGH libraries
sqrt(4)
- the 4
is an argumentsqrt(4)
returns the value 2
SLIDE: Getting Help for Functions
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL
> ?sqrt
> help(sqrt)
> ??sqrt
> help.search("sqrt")
> help.search("categorical")
> vignette(two-table)
Error in vignette(two - table) : object 'two' not found
> vignette("two-table")
SLIDE: Removing Variables
rm()
ls()
IS A FUNCTION THAT LISTS VARIABLES (like the Environment tab)> x <- 1
> y <- 2
> z <- 3
> ls()
[1] "x" "y" "z"
> rm(x)
> ls()
[1] "y" "z"
> rm(y, z)
> ls()
character(0)
SLIDE: Challenge 02
Solution:
mass <- 47.5
This will give a value of 47.5 for the variable mass
age <- 122
This will give a value of 122 for the variable age
mass <- mass * 2.3
This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.
age <- age - 20
This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable age.
SLIDE: Good Variable Names
Use a CONSISTENT NAMING STYLE
R
SLIDE: Good Project Management Practices
SLIDE: Example Directory Structure
WORKING DIR/
is the root directory of the project.
git
files; configuration files; notes to yourself; whatever)data/
is a subdirectory for storing data
data/raw
, data/intermediate
- USE SUBFOLDERS WHEN SENSIBLEdata_output/
could be a place to write the analysis output (.csv
files etc.)documents/
is a place where notes, drafts, and explanatory text could be storedfig_output/
could be a place to write graphical output of the analysis (keep separate from tables)scripts
might be where you would choose to keep executable code that automates your analysis
SLIDE: Project Management in RStudio
RStudio
TRIES TO BE HELPFUL and provides the ‘Project’ concept
GIT
RStudio
INTERACTIVE DEMO
File
-> New Project
GitHub
or some other repositoryNew Directory
RStudio
. Here we want New Project
New Project
swc-r-lesson
Create a git repository
- this will create and initialise a git
repository, just for this projectCreate Project
YOU SHOULD SEE AN EMPTY-ISH RSTUDIO
WINDOW
*.Rproj
- information about your project; .gitignore
- your project’s .gitignore
file (remember the git
lesson?)GIT
TAB
?
)DIFF
TO SEE CHANGES (note colours, lines, etc.)COMMIT
TO COMMIT CHANGESCOMMIT
(explain message)scripts
and data
New Folder
scripts
)Files
tab (but not in the git
tab, as the directory is empty)data/
SLIDE: Working in RStudio
RStudio
offers SEVERAL WAYS TO WRITE CODE
RStudio
also has an editor for writing scripts, notebooks, markdown documents, and Shiny applications (EXPLAIN BRIEFLY)INTERACTIVE DEMO OF R
SCRIPT
File
-> New File
-> Text File
. NOTE THAT THE EDITOR WINDOW OPENScoat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
data/feline_data.csv
data/
subdirectoryfeline_data.csv
GIT
TAB
data
directory appears!Staged
and THE FILE APPEARSCommit
and add a commit string, e.g. “add cat dataset”CLOSE THE EDITOR FOR THAT FILE
File
-> New File
-> R Script
.read.csv()
read.csv()
is a FUNCTION that reads data from a CSV-FORMAT FILE into a variable in R
# Script for exploring data structures
# Load cat data as a dataframe
cats <- read.csv(file = "data/feline_data.csv")
File
-> Save
scripts/
subdirectorydata_structures
(EXTENSION IS AUTOMATICALLY APPLIED)
NOTE CHANGES IN GIT
TABscripts
directory appears!Staged
and the file appearsCommit
and add a commit string, e.g. “add data structures script”Source
and NOTE THIS RUNS THE WHOLE SCRIPTEnvironment
tab
cats
SLIDE: 03. A First Analysis in RStudio
SLIDE: Learning Objectives
R
project/analysisSLIDE: Our Task
We’ve been ASKED TO PRODUCE A SUMMARY AND SOME GRAPHS
data/
data
, called data
. THIS IS UNTIDY, SO LET’S CLEANinflammation
TO THE PARENT FOLDERSLIDE: Loading Data - Interactive Demo
START DEMO
View File
read.csv()
to read the data inscripts/inflammation
(RStudio
adds the extension)Files
and Git
windows# Preliminary analysis of inflammation in arthritis patients
# Load data (no headers, CSV)
data <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
Source
the scriptEnvironment
window: 60 observations (patients) of 40 variables (days)data
Vn
for variable ndim()
- dimensions of data: rows X columnslength()
- number of columns in the tablencol()
- number of columns in the tablenrow()
- number of rows in the table> head(data, n = 2)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26
1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18 6 13 11 11 7 7
2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6 18 4 12 5 12 7
V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
1 4 6 8 8 4 4 5 7 3 4 2 3 0 0
2 11 5 11 3 3 5 4 4 5 5 1 1 0 1
> dim(data)
[1] 60 40
> length(data)
[1] 40
> ncol(data)
[1] 40
> nrow(data)
[1] 60
SLIDE: Challenge 03
SOLUTION
read.csv(file='file.csv', sep=';', dec=',')
SLIDE: Indexing Data
[row, column]
in square brackets> ncol(data)
[1] 40
> data[1,1]
[1] 0
> data[50,1]
[1] 0
> data[50,20]
[1] 16
> data[30,20]
[1] 16
:
separator to mean ‘to’:> data[1:4, 1:4] # rows 1 to 4; columns 1 to 4
V1 V2 V3 V4
1 0 0 1 3
2 0 1 2 1
3 0 1 1 3
4 0 0 2 0
> data[30:32, 20:22]
V20 V21 V22
30 16 14 15
31 16 13 7
32 9 19 15
> data[5,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26
5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17 9 14 9 7 13 9
V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
5 12 6 7 7 9 6 3 2 2 4 2 0 1 1
> data[,16]
[1] 4 4 15 8 10 15 13 9 11 6 3 8 12 3 5 10 11 4 11 13 15 5 14 13 4 9 13 6 7 6 14
[32] 3 15 4 15 11 7 10 15 6 5 6 15 11 15 6 11 15 14 4 10 15 11 6 13 8 4 13 12 9
SLIDE: Summary Functions - Interactive Demo
R
was designed for data analysis, so has many built-in functions for analysing and describing data
> max(data)
[1] 20
> max(data[2,])
[1] 18
> max(data[,7])
[1] 6
> min(data[,7])
[1] 1
> mean(data[,7])
[1] 3.8
> median(data[,7])
[1] 4
> sd(data[,7])
[1] 1.725187
SLIDE: Challenge 04
> animal <- c('m', 'o', 'n', 'k', 'e', 'y')
> animal[1:3]
[1] "m" "o" "n"
> animal[4:6]
[1] "k" "e" "y"
> animal[3:1]
[1] "n" "o" "m"
> animal[-1]
[1] "o" "n" "k" "e" "y"
> animal[-4]
[1] "m" "o" "n" "e" "y"
> animal[-1:-4]
[1] "e" "y"
> animal[-1:4]
Error in animal[-1:4] : only 0's may be mixed with negative subscripts
SLIDE: Repetitive Calculations - Interactive Demo
R
R
has an apply()
function exactly for this> apply(X = data, MARGIN = 1, FUN = mean)
[1] 5.450 5.425 6.100 5.900 5.550 6.225 5.975 6.650 6.625 6.525 6.775 5.800 6.225 5.750 5.225
[16] 6.300 6.550 5.700 5.850 6.550 5.775 5.825 6.175 6.100 5.800 6.425 6.050 6.025 6.175 6.550
[31] 6.175 6.350 6.725 6.125 7.075 5.725 5.925 6.150 6.075 5.750 5.975 5.725 6.300 5.900 6.750
[46] 5.925 7.225 6.150 5.950 6.275 5.700 6.100 6.825 5.975 6.725 5.700 6.250 6.400 7.050 5.900
R
functions# Calculate average inflammation by patient and day
avg_inflammation_patient <- apply(X = data, MARGIN = 1, FUN = mean)
avg_inflammation_day <- apply(data, 2, mean)
R
function that’s a shortcut
> rowMeans(data)
[1] 5.450 5.425 6.100 5.900 5.550 6.225 5.975 6.650 6.625 6.525 6.775 5.800 6.225 5.750 5.225
[16] 6.300 6.550 5.700 5.850 6.550 5.775 5.825 6.175 6.100 5.800 6.425 6.050 6.025 6.175 6.550
[31] 6.175 6.350 6.725 6.125 7.075 5.725 5.925 6.150 6.075 5.750 5.975 5.725 6.300 5.900 6.750
[46] 5.925 7.225 6.150 5.950 6.275 5.700 6.100 6.825 5.975 6.725 5.700 6.250 6.400 7.050 5.900
> colMeans(data)
V1 V2 V3 V4 V5 V6 V7 V8 V9
0.0000000 0.4500000 1.1166667 1.7500000 2.4333333 3.1500000 3.8000000 3.8833333 5.2333333
V10 V11 V12 V13 V14 V15 V16 V17 V18
5.5166667 5.9500000 5.9000000 8.3500000 7.7333333 8.3666667 9.5000000 9.5833333 10.6333333
V19 V20 V21 V22 V23 V24 V25 V26 V27
11.5666667 12.3500000 13.2500000 11.9666667 11.0333333 10.1666667 10.0000000 8.6666667 9.1500000
V28 V29 V30 V31 V32 V33 V34 V35 V36
7.2500000 7.3333333 6.5833333 6.0666667 5.9500000 5.1166667 3.6000000 3.3000000 3.5666667
V37 V38 V39 V40
2.4833333 1.5000000 1.1333333 0.5666667
SLIDE: Base Graphics
VISUALISATION IS A KEY ROUTE TO INSIGHT
R
has many graphics packages - some of which produce extremely beautiful images, or are tailored to a specific problem domainSLIDE: Plotting - Interactive Demo
R
’s plot()
FUNCTION IS GENERAL AND WORKS FOR MANY KINDS OF DATA# Plot data summaries
# Average inflammation by patient
plot(avg_inflammation_patient)
# Average inflammation per day
plot(avg_inflammation_day)
# Maximum inflammation per day
max_inflammation_day <- apply(data, 2, max)
plot(max_inflammation_day)
# Minimum inflammation per day
plot(apply(data, 2, min))
# Show a historgram of average patient inflammation
hist(avg_inflammation_patient)
hist()
FUNCTION PLOTS A HISTOGRAM OF INPUT DATA FREQUENCY/COUNT
hist(avg_inflammation_patient, breaks=c(5, 6, 7, 8))
seq()
function generates a sequence of numbers for us> seq(5, 8)
[1] 5 6 7 8
> hist(avg_inflammation_patient, breaks=seq(5, 8))
> seq(5, 8, by=0.2)
[1] 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0
> hist(avg_inflammation_patient, breaks=seq(5, 8, by=0.2))
# Show a historgram of average patient inflammation
hist(avg_inflammation_patient, breaks=seq(5, 8, by=0.2))
SLIDE: Challenge 05
# Plot standard deviation by day
plot(apply(data, 2, sd))
SLIDE: 04. Data Structures in R
SLIDE: Learning Objectives
R
: WHAT DATA ISR
’s data types and structures relate to the types of data that you work with, yourself.SLIDE: Data Types and Structures in R
R
is MOSTLY USED FOR DATA ANALYSISR
is set up with key, core data types designed to help you work with your own dataR
focuses on tabular data (like our cat example)INTERACTIVE DEMO
cats
is available as a variable)cats
, you get a nice tabular representation of your data> cats
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
coat
is text; weight
is some real value (in kg or pounds, maybe), and likes_string
looks like it should be TRUE
/FALSE
$
notation in the console> cats$weight
[1] 2.1 5.0 3.2
R
RETURN?
R
is largely built so that operations on vectors are central to data analysis.> cats$weight + 2
[1] 4.1 7.0 5.2
> cats$coat
[1] calico black tabby
Levels: black calico tabby
R
RETURN?
R
DOESN’T THINK THEY’RE ONLY WORDS - it THINKS THEY’RE NAMED CATEGORIES OF OBJECT. R
is assuming that you mean to import datapaste()
)> paste("My cat is", cats$coat)
[1] "My cat is calico" "My cat is black" "My cat is tabby"
> cats$weight + cats$coat
[1] NA NA NA
Warning message:
In Ops.factor(cats$weight, cats$coat) : ‘+’ not meaningful for factors
R
’s data types reflect the ways in which data is expected to interactR
’s DATA TYPES IS KEY
R
sees your data (you want R
to see your data the same way you do)R
come down to incompatibilities between data and data types.SLIDE: What Data Types Do You Expect?
SLIDE: Data Types in R
R
’s data types are atomic: they are FUNDAMENTAL AND EVERYTHING ELSE IS BUILT UP FROM THEM, like matter is built up from atoms
R
(though one is split into two…)
1
/0
)integer
and double
(real)# Some variables of several data types
truth <- TRUE
lie <- FALSE
i <- 3L
d <- 3.0
c <- 3 + 0i
txt <- "TRUE"
Run
Data
and Values
in the environmentGit
tab: commit the changetypeof()
TO FIND THE TYPE OF A VARIABLE> typeof(i)
[1] "integer"
> typeof(c)
[1] "complex"
> typeof(d)
[1] "double"
is.<type>()
> is.numeric(3)
[1] TRUE
> is.numeric(d)
[1] TRUE
> is.double(i)
[1] FALSE
> is.integer(d)
[1] FALSE
> is.numeric(txt)
[1] FALSE
> is.character(txt)
[1] TRUE
> is.character(truth)
[1] FALSE
> is.logical(truth)
[1] TRUE
> i == c
[1] TRUE
> i == d
[1] TRUE
> d == c
[1] TRUE
numeric
though> is.numeric(i)
[1] TRUE
> is.numeric(c)
[1] FALSE
SLIDE: Challenge 06
SLIDE: FIVE COMMON R
DATA STRUCTURES
R
INTERACTIVE DEMO IN SCRIPT
Run
to run in consolec()
FUNCTION (c()
is combine
; use ?c
)# Define an integer vector
x <- c(10, 12, 45, 33)
R
functions to find out more about this variable
> length(x)
[1] 4
> typeof(x)
[1] "double"
> str(x)
num [1:4] 10 12 45 33
str()
function REPORTS THE STRUCTURE OF A VARIABLE
num
means ‘numeric’; [1:4]
means there are four elements; the elements are listed# Define a vector
xx <- c(1, 2, 'a')
> length(xx)
[1] 3
> typeof(xx)
[1] "character"
> str(xx)
chr [1:3] "1" "2" "a"
R
- they think their data is of one type, but R
thinks it makes more sense to have it as another typeSLIDE: Challenge 07
SLIDE: Coercion
R
thinks it needs to, it will COERCE DATA IMPLICITLY without telling youlogical
can be coerced to integer
, but integer
cannot be coerced to logical
integer
can describe all logical
values, but not vice versacharacter
, so that’s the fallback position for R
R
MIGHT CONVERT THE TYPE TO COPE
R
will choose the simplest data type that can represent all items in the vectoras.<type>()
> as.character(x)
[1] "10" "12" "45" "33"
> as.complex(x)
[1] 10+0i 12+0i 45+0i 33+0i
> as.logical(x)
[1] TRUE TRUE TRUE TRUE
> xx
[1] "1" "2" "a"
> as.numeric(xx)
[1] 1 2 NA
Warning message:
NAs introduced by coercion
> as.logical(xx)
[1] NA NA NA
seq()
function returns a vector:
operator> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(35, 40, by=0.5)
[1] 35.0 35.5 36.0 36.5 37.0 37.5 38.0 38.5 39.0 39.5 40.0
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 5:8
[1] 5 6 7 8
c()
> x
[1] 10 12 45 33
> c(x, 57)
[1] 10 12 45 33 57
> x
[1] 10 12 45 33
> x <- c(x, 57)
> x
[1] 10 12 45 33 57
> x <- 0:10
> tail(x)
[1] 5 6 7 8 9 10
> head(x)
[1] 0 1 2 3 4 5
> head(x, n=2)
[1] 0 1
> x <- 1:4
> names(x)
NULL
> str(x)
int [1:4] 1 2 3 4
> names(x) <- c("a", "b", "c", "d")
> x
a b c d
1 2 3 4
> str(x)
Named int [1:4] 1 2 3 4
- attr(*, "names")= chr [1:4] "a" "b" "c" "d"
SLIDE: Factors
R
WAS MADE FOR STATISTICS so has a special way of dealing with the difference
Run
the line# Create a factor with three elements
> f <- factor(c("no", "yes", "no"))
"yes"
and "no"
1 2 1
"no" -> 1
and "yes" -> 2
1
and 2
, BUT THESE ARE LABELLED "no"
and "yes"
> length(f)
[1] 3
> str(f)
Factor w/ 2 levels "no","yes": 1 2 1
> levels(f)
[1] "no" "yes"
> f
[1] no yes no
Levels: no yes
cats
DATA THE COAT WAS STORED AS A FACTORclass()
function IDENTIFIES DATA STRUCTURES> cats$coat
[1] calico black tabby
Levels: black calico tabby
> class(cats$coat)
[1] "factor"
> str(cats$coat)
Factor w/ 3 levels "black","calico",..: 2 1 3
SLIDE: Challenge 08
> f <- factor(c("case", "control", "case", "control", "case"))
> str(f)
Factor w/ 2 levels "case","control": 1 2 1 2 1
> f <- factor(c("case", "control", "case", "control", "case"), levels=c("control", "case"))
> str(f)
Factor w/ 2 levels "control","case": 2 1 2 1 2
SLIDE: Matrices
R
They are 2D vector
s (so contain atomic values)
Run
the lines when done# Create matrix of zeroes
m1 <- matrix(0, ncol = 6, nrow = 3)
# Create matrix of numbers 1 and 2
m2 <- matrix(c(1, 2), ncol = 3, nrow = 4)
ncol
and nrow
define the size of the matrixlength()
of a matrix IS THE TOTAL NUMBER OF ELEMENTS> class(m1)
[1] "matrix"
> m1
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0
[2,] 0 0 0 0 0 0
[3,] 0 0 0 0 0 0
> str(m1)
num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
> length(m1)
[1] 18
> m2
[,1] [,2] [,3] [,4]
[1,] 1 2 1 2
[2,] 2 1 2 1
[3,] 1 2 1 2
> m2[1, ]
[1] 1 2 1 2
> m2[2:3, 3:4]
[,1] [,2]
[1,] 2 1
[2,] 1 2
SLIDE: Challenge 09 (5min)
> m <- matrix(1:50, nrow = 5, ncol = 10)
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 6 11 16 21 26 31 36 41 46
[2,] 2 7 12 17 22 27 32 37 42 47
[3,] 3 8 13 18 23 28 33 38 43 48
[4,] 4 9 14 19 24 29 34 39 44 49
[5,] 5 10 15 20 25 30 35 40 45 50
> ?matrix
> m <- matrix(1:50, nrow = 5, ncol = 10, byrow = TRUE)
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 11 12 13 14 15 16 17 18 19 20
[3,] 21 22 23 24 25 26 27 28 29 30
[4,] 31 32 33 34 35 36 37 38 39 40
[5,] 41 42 43 44 45 46 47 48 49 50
SLIDE: Lists
list
s are like vectors, EXCEPT THEY CAN HOLD ANY DATA TYPE
CREATE NEW LIST IN SCRIPT
Run
from script# Create a list
l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f)
# Create a named list
l_named <- list(a = "SWC", b = 1:4)
[[n]]
> class(l)
[1] "list"
> class(l_named)
[1] "list"
> str(l)
List of 5
$ : num 1
$ : chr "a"
$ : logi TRUE
$ : num [1:2, 1:2] 0 0 0 0
$ : Factor w/ 2 levels "no","yes": 1 2 1
> str(l_named)
List of 2
$ a: chr "SWC"
$ b: int [1:4] 1 2 3 4
> l
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[,1] [,2]
[1,] 0 0
[2,] 0 0
[[5]]
[1] no yes no
Levels: no yes
> l[[4]][1,1]
[1] 0
$
> l_named
$a
[1] "SWC"
$b
[1] 1 2 3 4
> l_named[[1]]
[1] "SWC"
> l_named[[2]]
[1] 1 2 3 4
> l_named$a
[1] "SWC"
> l_named$b
[1] 1 2 3 4
> names(l_named)
[1] "a" "b"
SLIDE: Logical Indexing
data_structures.R
)
Run
the lines# Create a vector for logical indexing
v <- c(5.4, 6.2, 7.1, 4.8, 7.5)
mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
> v
[1] 5.4 6.2 7.1 4.8 7.5
> v[mask]
[1] 5.4 7.1 7.5
R
RETURN VECTORS OF TRUE/FALSE VALUES
> v
[1] 5.4 6.2 7.1 4.8 7.5
> v < 7
[1] TRUE TRUE FALSE TRUE FALSE
> v[v < 7]
[1] 5.4 6.2 4.8
> v < 7
[1] TRUE TRUE FALSE TRUE FALSE
> v > 5 & v < 7
[1] TRUE TRUE FALSE FALSE FALSE
> v[v > 5 & v < 7]
[1] 5.4 6.2
> v > 5 | v < 7
[1] TRUE TRUE TRUE TRUE TRUE
> v[v > 5 | v < 7]
[1] 5.4 6.2 7.1 4.8 7.5
SLIDE: 05. Dataframes
R
.R
, on a practical day-to-day basis, involves dataframesSLIDE: Learning Objectives
R
DATA STRUCTURES YOU ALREADY KNOWSLIDE: Let’s look at a data.frame
cats
data is a small data.frame
names()
… IT’S A NAMED LIST> class(cats)
[1] "data.frame"
> cats
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
> length(cats)
[1] 3
> cats[[1]]
[1] calico black tabby
Levels: black calico tabby
> typeof(cats)
[1] "list"
> names(cats)
[1] "coat" "weight" "likes_string"
> class(cats$coat)
[1] "factor"
> class(cats$weight)
[1] "numeric"
> class(cats$likes_string)
[1] "integer"
SLIDE: What is a data.frame
R
R
a bit more data-safeSLIDE: Creating a data.frame
Run
when done# Create a data frame
df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'),
c=c(TRUE, FALSE, TRUE))
summary()
function SUMMARISES PROPERTIES OF EACH COLUMN> str(df)
'data.frame': 3 obs. of 3 variables:
$ a: num 1 2 3
$ b: Factor w/ 3 levels "eeny","meeny",..: 1 2 3
$ c: logi TRUE FALSE TRUE
> df$c
[1] TRUE FALSE TRUE
> length(df)
[1] 3
> dim(df)
[1] 3 3
> summary(df)
a b c
Min. :1.0 eeny :1 Mode :logical
1st Qu.:1.5 meeny:1 FALSE:1
Median :2.0 miney:1 TRUE :2
Mean :2.0
3rd Qu.:2.5
Max. :3.0
SLIDE: Challenge 10
author_book <- data.frame(author_first = c('Charles', 'Ernst', "Theodosius"),
author_last = c("Darwin", "Mayr", "Dobzhansky"),
year = c(1859, 1942, 1970))
SLIDE: Challenge 11
> country_climate <- data.frame(country=c("Canada", "Panama",
+ "South Africa", "Australia"),
+ climate=c("cold", "hot",
+ "temperate", "hot/temperate"),
+ temperature=c(10, 30, 18, "15"),
+ northern_hemisphere=c(TRUE, TRUE,
+ FALSE, "FALSE"),
+ has_kangaroo=c(FALSE, FALSE,
+ FALSE, 1))
> str(country_climate)
'data.frame': 4 obs. of 5 variables:
$ country : Factor w/ 4 levels "Australia","Canada",..: 2 3 4 1
$ climate : Factor w/ 4 levels "cold","hot","hot/temperate",..: 1 2 4 3
$ temperature : Factor w/ 4 levels "10","15","18",..: 1 4 3 2
$ northern_hemisphere: Factor w/ 2 levels "FALSE","TRUE": 2 2 1 1
$ has_kangaroo : num 0 0 0 1
SLIDE: Challenge 12
> df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'),
+ c=c(TRUE, FALSE, TRUE),
+ stringsAsFactors = FALSE)
> str(df)
'data.frame': 3 obs. of 3 variables:
$ a: num 1 2 3
$ b: chr "eeny" "meeny" "miney"
$ c: logi TRUE FALSE TRUE
SLIDE: Adding rows and columns
> df
a b c
1 1 eeny TRUE
2 2 meeny FALSE
3 3 miney TRUE
> df <- cbind(df, vals = 3:1)
> df
a b c vals
1 1 eeny TRUE 3
2 2 meeny FALSE 2
3 3 miney TRUE 1
> df <- rbind(df, list(4, 'mo', FALSE, 0))
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "mo") :
invalid factor level, NA generated
> levels(df$b) <- c('eeny', 'meeny', 'miney', 'mo')
> df <- rbind(df, list(4, 'mo', FALSE, 0))
> > df
a b c vals
1 1 eeny TRUE 3
2 2 meeny FALSE 2
3 3 miney TRUE 1
4 4 <NA> FALSE 0
5 4 mo FALSE 0
-
syntaxNA
valuesdf
with one of these> df[-4,]
a b c vals
1 1 eeny TRUE 3
2 2 meeny FALSE 2
3 3 miney TRUE 1
5 4 mo FALSE 0
> na.omit(df)
a b c vals
1 1 eeny TRUE 3
2 2 meeny FALSE 2
3 3 miney TRUE 1
5 4 mo FALSE 0
> df <- na.omit(df)
> df
a b c vals
1 1 eeny TRUE 3
2 2 meeny FALSE 2
3 3 miney TRUE 1
5 4 mo FALSE 0
SLIDE: Writing data.frame
to file
write.table()
function WRITES A DATAFRAME TO A FILE\t
means ‘tab’ - it puts a gap between columnswrite.table(df, "data/df_example.tab", sep="\t")
Files
tabRStudio
\t
has given spaces as column separatorsSLIDE: Reading into a data.frame
data/
gapminder
# Load gapminder data from a URL
gapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)
Environment
TAB
gapminder
in Evironment
tab.SLIDE: Investigating gapminder
> str(gapminder)
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
> typeof(gapminder$year)
[1] "integer"
> typeof(gapminder$country)
[1] "integer"
> str(gapminder$country)
Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
> length(gapminder)
[1] 6
> nrow(gapminder)
[1] 1704
> ncol(gapminder)
[1] 6
> dim(gapminder)
[1] 1704 6
> colnames(gapminder)
[1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
> head(gapminder)
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
> summary(gapminder)
country year pop continent lifeExp
Afghanistan: 12 Min. :1952 Min. :6.001e+04 Africa :624 Min. :23.60
Albania : 12 1st Qu.:1966 1st Qu.:2.794e+06 Americas:300 1st Qu.:48.20
Algeria : 12 Median :1980 Median :7.024e+06 Asia :396 Median :60.71
Angola : 12 Mean :1980 Mean :2.960e+07 Europe :360 Mean :59.47
Argentina : 12 3rd Qu.:1993 3rd Qu.:1.959e+07 Oceania : 24 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :1.319e+09 Max. :82.60
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
SLIDE: Subsets of data.frame
s
# Extract a single column, get a dataframe
> head(gapminder[3])
pop
1 8425333
2 9240934
3 10267083
4 11537966
5 13079460
6 14880372
> class(head(gapminder[3]))
[1] "data.frame"
# Extract a single named column, get a vector/factor
> head(gapminder[["lifeExp"]])
[1] 28.801 30.332 31.997 34.020 36.088 38.438
> class(head(gapminder[["lifeExp"]]))
[1] "numeric"
> head(gapminder$year)
[1] 1952 1957 1962 1967 1972 1977
> class(head(gapminder$year))
[1] "integer"
# Slice rows like a matrix, get a dataframe
> gapminder[1:3,]
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
> class(gapminder[1:3,])
[1] "data.frame"
> gapminder[3,]
country year pop continent lifeExp gdpPercap
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
> class(gapminder[3, ])
[1] "data.frame"
# Slice columns like a matrix, get vector/factor
> head(gapminder[, 3])
[1] 8425333 9240934 10267083 11537966 13079460 14880372
> class(head(gapminder[, 3]))
[1] "numeric"
# Slice columns like a matrix get dataframe
> head(gapminder[, 3, drop=FALSE])
pop
1 8425333
2 9240934
3 10267083
4 11537966
5 13079460
6 14880372
> class(head(gapminder[, 3, drop=FALSE]))
[1] "data.frame"
SLIDE: Challenge 13
# Extract observations collected for the year 1957
> head(gapminder[gapminder$year == 1957,])
country year pop continent lifeExp gdpPercap
2 Afghanistan 1957 9240934 Asia 30.332 820.853
14 Albania 1957 1476505 Europe 59.280 1942.284
26 Algeria 1957 10270856 Africa 45.685 3013.976
38 Angola 1957 4561361 Africa 31.999 3827.940
50 Argentina 1957 19610538 Americas 64.399 6856.856
62 Australia 1957 9712569 Oceania 70.330 10949.650
# Extract all columns except 1 through 4
> head(gapminder[, -(1:4)])
lifeExp gdpPercap
1 28.801 779.4453
2 30.332 820.8530
3 31.997 853.1007
4 34.020 836.1971
5 36.088 739.9811
6 38.438 786.1134
> head(gapminder[, -1:-4])
lifeExp gdpPercap
1 28.801 779.4453
2 30.332 820.8530
3 31.997 853.1007
4 34.020 836.1971
5 36.088 739.9811
6 38.438 786.1134
# Extract all rows where life expectancy is greater than 80 years
> head(gapminder[gapminder$lifeExp > 80,])
country year pop continent lifeExp gdpPercap
71 Australia 2002 19546792 Oceania 80.370 30687.75
72 Australia 2007 20434176 Oceania 81.235 34435.37
252 Canada 2007 33390141 Americas 80.653 36319.24
540 France 2007 61083916 Europe 80.657 30470.02
671 Hong Kong China 2002 6762476 Asia 81.495 30209.02
672 Hong Kong China 2007 6980412 Asia 82.208 39724.98
# ADVANCED: Extract rows for years 2002 and 2007
> head(gapminder[gapminder$year == 2002 | gapminder$year == 2007,])
country year pop continent lifeExp gdpPercap
11 Afghanistan 2002 25268405 Asia 42.129 726.7341
12 Afghanistan 2007 31889923 Asia 43.828 974.5803
23 Albania 2002 3508512 Europe 75.651 4604.2117
24 Albania 2007 3600523 Europe 76.423 5937.0295
35 Algeria 2002 31287142 Africa 70.994 5288.0404
36 Algeria 2007 33333216 Africa 72.301 6223.3675
> head(gapminder[gapminder$year %in% c(2002, 2007),])
country year pop continent lifeExp gdpPercap
11 Afghanistan 2002 25268405 Asia 42.129 726.7341
12 Afghanistan 2007 31889923 Asia 43.828 974.5803
23 Albania 2002 3508512 Europe 75.651 4604.2117
24 Albania 2007 3600523 Europe 76.423 5937.0295
35 Algeria 2002 31287142 Africa 70.994 5288.0404
36 Algeria 2007 33333216 Africa 72.301 6223.3675
# The %in% operator
> 1 %in% c(1, 2, 3, 4, 5)
[1] TRUE
> 6 %in% c(1, 2, 3, 4, 5)
[1] FALSE
SLIDE: 06. Packages
SLIDE: Learning Objectives
SLIDE: Packages
R
When you write your own code, you can distribute it as a package
installed.packages()
install.packages("packagename")
as a string EXPLAIN DEPENDENCIESRStudio
: Tools
$\rightarrow$ Install packages...
RStudio
update.packages()
DON’T DO THIS - CAN TAKE TIME!> installed.packages()
Package
BiocInstaller "BiocInstaller"
bit "bit"
bit64 "bit64"
data.table "data.table"
[...]
> install.packages("dplyr")
Installing package into ‘/Users/lpritc/Library/R/3.4/library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘bindrcpp’, ‘glue’, ‘rlang’
[...]
> update.packages(ask=FALSE)
> library(dplyr)
SLIDE: Challenge 14
SLIDE: Visualisation is Critical
SLIDE: Learning Objectives
ggplot2
to generate those plotsSLIDE: The Grammar of Graphics
ggplot2
package, which is part of the TIDYVERSE, created initially by Hadley Wickham.
ggplot2
on its ownggplot2
implements A SET OF CONCEPTS CALLED THE GRAMMAR OF GRAPHICS
SLIDE: A Basic Scatterplot
ggplot2
in the SAME WAY YOU’D USE BASE GRAPHICS
ggplot2
has qplot()
- the equivalent to plot()
in base graphicsplot()
takes x
and y
values, and will assign colours to factor
columnsqplot()
takes the name of x
and y
columns, plus the name of the source data.frame
, and will assign colours to factor
columns> library(ggplot2)
> plot(gapminder$lifeExp, gapminder$gdpPercap, col=gapminder$continent)
> qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)
ggplot2
has nicer default stylesggplot2
provides gridlines and legends by default, and the labelling is clearer (no gapminder$
prefix)ggplot2
!*SLIDE: What is a Plot? *aesthetics **
*SLIDE: What is a Plot? *aesthetics **
data.frame
SLIDE: What is a Plot? geom
s
geom
s (short for geometries) DEFINE THE KIND OF PLOT WE PRODUCE
geom
s with the *same data and aesthetics **SLIDE: What is a Plot? geom
s
gapminder.R
)
ggplot()
function.data
, and aesthetics with aes
geom
geom_point()
# Generate plot of GDP per capita against life Expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_point()
COMMIT CHANGES TO SCRIPT
geom
?p + geom_line()
SLIDE: Challenge 15
# Plot life expectancy against time
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent))
p + geom_point()
*SLIDE: What is a Plot? *layers **
ggplot2
plots are built as layersgeom
defining the type of plotggplot
object describes a *base layer, and can contain data and aesthetics **
*SLIDE: What is a Plot? *layers **
gapminder
geom_point
geom
LAYERS ARE ADDED WITH THE +
OPERATOR
*SLIDE: What is a Plot? *layers **
geom
to geom_line
# Generate plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country))
*SLIDE: What is a Plot? *layers **
geom
S to produce a more complex plotgeom_point()
LAYER WITH +
alpha
argument to control transparency# Generate plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.4)
SLIDE: Challenge 16
# Generate plot of life expectancy against time
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.35)
SLIDE: Transformations and scale
s
scale
layersgapminder.R
)
# Generate plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p <- p + geom_line(aes(group=country)) + geom_point(alpha=0.4)
p + scale_y_log10() + scale_color_grey()
SLIDE: Statistics layers
geom
layers transform the dataset
# Generate summary plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap))
p + geom_point(alpha=0.4) + scale_y_log10()
# Generate summary plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap))
p <- p + geom_point(alpha=0.4) + scale_y_log10()
p + geom_smooth()
# Generate summary plot of GDP per capita against life expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap))
p <- p + geom_point(alpha=0.4) + scale_y_log10()
p + geom_density_2d(color="purple")
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap))
p <- p + geom_point(alpha=0.4, aes(color=continent)) + scale_y_log10()
p + geom_density_2d(color="purple")
SLIDE: Multi-panel figures
facet_wrap()
layer allows us to make grids of plots, SPLIT BY A FACTOR# Compare life expectancy over time by country
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent, group=country))
p + geom_line() + scale_y_log10()
facet_wrap()
to split by continent is clearer
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent, group=country))
p <- p + geom_line() + scale_y_log10()
p + facet_wrap(~continent)
SLIDE: Challenge 17 (10min)
# Contrast GDP per capita against population
p <- ggplot(data=gapminder, aes(x=pop, y=gdpPercap))
p <- p + geom_point(alpha=0.8, aes(color=continent))
p <- p + scale_y_log10() + scale_x_log10()
p + geom_density_2d(alpha=0.5) + facet_wrap(~year)
data.frame
s in dplyr
SLIDE: Learning Objectives
You’re going to learn to manipulate data.frame
s with the six verbs of dplyr
select()
filter()
group_by()
summarize()
mutate()
%>%
(pipe)SLIDE: What and Why is dplyr
?
dplyr
is a package in the TIDYVERSE; it exists to enable rapid analysis of data by groups
gapminder
data by continent, we’d use dplyr
SLIDE: Split-Apply-Combine
The general principle dplyr
supports is SPLIT-APPLY-COMBINE
x
)y
for each group, for example
x
SLIDE: select()
- Interactive Demo
dplyr
> library(dplyr)
select()
verb SELECTS COLUMNS
gapminder
> head(select(gapminder, year, country, gdpPercap))
year country gdpPercap
1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
%>%
> head(gapminder %>% select(year, country, gdpPercap))
year country gdpPercap
1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
SLIDE: filter()
filter()
selects rows on the basis of some condition, or combination of conditions
> head(filter(gapminder, continent=="Europe"))
country year pop continent lifeExp gdpPercap
1 Albania 1952 1282697 Europe 55.23 1601.056
2 Albania 1957 1476505 Europe 59.28 1942.284
3 Albania 1962 1728137 Europe 64.82 2312.889
4 Albania 1967 1984060 Europe 66.22 2760.197
5 Albania 1972 2263554 Europe 67.69 3313.422
6 Albania 1977 2509048 Europe 68.93 3533.004
gapminder.R
)
R
knows that there’s a continuationRun
the lines and check the output in Environment
Commit
the changes# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
filter(continent == "Europe") %>%
select(year, country, gdpPercap)
**SLIDE: Challenge 18
# Select life expectancy by country and year, only for Africa
afrodata <- gapminder %>%
filter(continent == "Africa") %>%
select(year, country, lifeExp)
SLIDE: group_by()
group_by()
verb SPLITS data.frame
s INTO GROUPS ON A COLUMN PROPERTYtibble
- a table with extra metadata describing the groups in the table> group_by(gapminder, continent)
# A tibble: 1,704 x 6
# Groups: continent [5]
country year pop continent lifeExp gdpPercap
<fctr> <int> <dbl> <fctr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
7 Afghanistan 1982 12881816 Asia 39.854 978.0114
8 Afghanistan 1987 13867957 Asia 40.822 852.3959
9 Afghanistan 1992 16317921 Asia 41.674 649.3414
10 Afghanistan 1997 22227415 Asia 41.763 635.3414
# ... with 1,694 more rows
**SLIDE: summarize()
group_by()
and summarize()
is very powerful
Here, we’ve split the original table into three groups, and now CREATE A NEW VARIABLE mean_b
THAT IS FILLED BY CALCULATING THE MEAN OF b
> # Produce table of mean GDP by continent
> gapminder %>%
+ group_by(continent) %>%
+ summarize(meangdpPercap=mean(gdpPercap))
# A tibble: 5 x 2
continent meangdpPercap
<fctr> <dbl>
1 Africa 2193.755
2 Americas 7136.110
3 Asia 7902.150
4 Europe 14469.476
5 Oceania 18621.609
SLIDE: challenge 19
# Find average life expectancy by nation
avg_lifexp_country <- gapminder %>%
group_by(country) %>%
summarize(meanlifeExp=mean(lifeExp))
> avg_lifexp_country[avg_lifexp_country$meanlifeExp == max(avg_lifexp_country$meanlifeExp),]
# A tibble: 1 x 2
country meanlifeExp
<fctr> <dbl>
1 Iceland 76.51142
> avg_lifexp_country[avg_lifexp_country$meanlifeExp == min(avg_lifexp_country$meanlifeExp),]
# A tibble: 1 x 2
country meanlifeExp
<fctr> <dbl>
1 Sierra Leone 36.76917
SLIDE: count()
and n()
summarize()
count()
reports a new table of counts by groupn()
is used to represent the count of rows, when calculating new values in summarize()
DEMO IN CONSOLE * NOTE: standard error is (std dev)/sqrt(n)
> gapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE)
# A tibble: 5 x 2
continent n
<fctr> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
> gapminder %>% group_by(continent) %>% summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))
# A tibble: 5 x 2
continent se_lifeExp
<fctr> <dbl>
1 Africa 0.3663016
2 Americas 0.5395389
3 Asia 0.5962151
4 Europe 0.2863536
5 Oceania 0.7747759
SLIDE: mutate()
mutate()
CALCULATES NEW VARIABLES (COLUMNS) ON THE BASIS OF EXISTING COLUMNSgapminder
data, plus an extra column# Calculate GDP in $billion
gdp_bill <- gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9)
summarize()
commandmutate()
in the summarize()
commandCommit
the changes# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
SLIDE: Learning Objectives
R
for()
loopsR
data analyses, because dplyr
exists, and because R
is vectorisedSLIDE: if()
… else
When this is the case, we can use the general if()
… else
structure, which is common to most programming languages
flow_control.R
)
Source
the filex > 10
is FALSE
)if()
block executes if the value in the parentheses evaluates to TRUE
# A data point
x <- 8
# Example if statement
if (x > 10) {
print("x is greater than 10")
}
else
blockSource
the code: we get a message# Example if statement
if (x > 10) {
print("x is greater than 10")
} else {
print("x is less than 10")
}
x <- 10
AND TRY AGAINelse if()
STATEMENT
Source
the script: NO OUTPUT# A data point
x <- 10
# Example if statement
if (x > 10) {
print("x is greater than 10")
} else if (x < 10) {
print("x is less than 10")
}
else
STATEMENT
Source
the script: EQUALS output# A data point
x <- 9
# Example if statement
if (x > 10) {
print("x is greater than 10")
} else if (x < 10) {
print("x is less than 10")
} else {
print("x is equal to 10")
}
SLIDE: Challenge 20
# Are there any records for a year
year <- 2002
if(any(gapminder$year == year)){
print("Record(s) for this year found.")
}
SLIDE: for()
loops
for()
loops can be usedfor()
loops are a very common programming constructThey express the idea: FOR EACH ITEM IN A GROUP, DO SOMETHING (WITH THAT ITEM)
flow_control.R
)
c(1,2,3)
, and we want to print each itemfor()
, where the argument names a variable (i
) - the iterator, and a set of values: for(i in c('a', 'b', 'c'))
# Basic for loop
for(i in c('a', 'b', 'c')){
print(i)
}
# Nested loop example
for (i in 1:5) {
for (j in c('a', 'b', 'c')) {
print(paste(i, j))
}
}
c()
to append to a vector# Capture loop output
output <- c()
for (i in 1:5) {
for (j in c('a', 'b', 'c', 'd', 'e')) {
output <- c(output, paste(i, j))
}
}
(output)
# Capture loop output
output_matrix <- matrix(nrow=5, ncol=5)
j_letters <- c('a', 'b', 'c', 'd', 'e')
for (i in 1:5) {
for (j in 1:5) {
output_matrix[i, j] <-paste(i, j_letters[j])
}
}
(output_matrix)
SLIDE: while()
loops
for()
looprunif()
generates random numbers from a uniform distribution# Example while loop
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
SLIDE: Challenge 21
# Challenge solution
for (l in letters) {
if (l %in% c('a', 'e', 'i', 'o', 'u')) {
value <- TRUE
} else {
value <- FALSE
}
print(paste(l, value))
}
SLIDE: Vectorisation
for()
and while()
loops can be useful, they are rarely the most efficient way to work in R
R
ARE VECTORISED
> x <- 1:4
> x
[1] 1 2 3 4
> x * 2
[1] 2 4 6 8
> y <- 6:9
> y
[1] 6 7 8 9
> x + y
[1] 7 9 11 13
> x * y
[1] 6 14 24 36
> x > 2
[1] FALSE FALSE TRUE TRUE
> y < 7
[1] TRUE FALSE FALSE FALSE
> any(y < 7)
[1] TRUE
> all(y < 7)
[1] FALSE
> log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944
> x^2
[1] 1 4 9 16
> sin(x)
[1] 0.8414710 0.9092974 0.1411200 -0.7568025
*
multiplication operator is a vectorised/elementwise multiplication%*%
operator> m <- matrix(1:4, nrow = 2, ncol = 2)
> m
[,1] [,2]
[1,] 1 3
[2,] 2 4
> m * m
[,1] [,2]
[1,] 1 9
[2,] 4 16
> m %*% m
[,1] [,2]
[1,] 7 15
[2,] 10 22
SLIDE: Challenge 21
> v = 1:10000
> v <- 1/(v^2)
> sum(v)
[1] 1.644834
SLIDE: Learning objectives
log()
) and, I hope, have found them usefullog()
each timeSLIDE: Why Functions?
We expect functions to have A DEFINED SET OF INPUTS AND OUTPUTS - aids clarity and understanding
FUNCTIONS ARE THE BUILDING BLOCKS OF PROGRAMMING
SLIDE: Defining a Function
<function_name>
function
function/keyword to assign the function to <function_name>
<does_something>
return()
function returns the value, when the function is calledfunctions.R
Source
# Example function
my_sum <- function(a, b) {
the_sum <- a + b
return(the_sum)
}
> my_sum(3, 7)
[1] 10
> a
Error: object 'a' not found
> b
Error: object 'b' not found
# Fahrenheit to Kelvin
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
> fahr_to_kelvin(32)
[1] 273.15
> fahr_to_kelvin(-40)
[1] 233.15
> fahr_to_kelvin(212)
[1] 373.15
> temp
Error: object 'temp' not found
Source
the script# Kelvin to Celsius
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}
> kelvin_to_celsius(273.15)
[1] 0
> kelvin_to_celsius(233.15)
[1] -40
> kelvin_to_celsius(373.15)
[1] 100
> fahr_to_kelvin(212)
[1] 373.15
> kelvin_to_celsius(fahr_to_kelvin(212))
[1] 100
# Fahrenheit to Celsius
fahr_to_celsius <- function(temp) {
celsius <- kelvin_to_celsius(fahr_to_kelvin(temp))
return(celsius)
}
R
’s VECTORISATION> fahr_to_celsius(212)
[1] 100
> fahr_to_celsius(32)
[1] 0
> fahr_to_celsius(-40)
[1] -40
> fahr_to_celsius(c(-40, 32, 212))
[1] -40 0 100
SLIDE: Documentation
But it’s not a detailed explanation
R
’s help useful, but it doesn’t exist for your functions until you write itYOUR FUTURE SELF WILL THANK YOU FOR DOING IT!
> ?fahr_to_celsius
No documentation for ‘fahr_to_celsius’ in specified packages and libraries:
you could try ‘??fahr_to_celsius’
> ??fahr_to_celsius
# Fahrenheit to Celsius
fahr_to_celsius <- function(temp) {
# Convert input temperature from fahrenheit to celsius scale
#
# temp - numeric
#
# Example:
# > fahr_to_celsius(c(-40, 32, 212))
# [1] -40 0 100
celsius <- kelvin_to_celsius(fahr_to_kelvin(temp))
return(celsius)
}
> fahr_to_celsius
function(temp) {
# Convert input temperature from fahrenheit to celsius scale
#
# temp - numeric
#
# Example:
# > fahr_to_celsius(c(-40, 32, 212))
# [1] -40 0 100
celsius <- kelvin_to_celsius(fahr_to_kelvin(temp))
return(celsius)
}
SLIDE: Function Arguments
DEMO IN SCRIPT (functions.R
)
Source
script# Calculate total GDP in gapminder data
calcGDP <- function(data) {
# Returns the gapminder data with additional column of total GDP
#
# data - gapminder dataframe
#
# Example:
# gapminderGDP <- calcGDP(gapminder)
gdp <- gapminder %>% mutate(gdp=pop * gdpPercap)
return(gdp)
}
> calcGDP(gapminder)
Error in gapminder %>% mutate(gdp = pop * gdpPercap) :
could not find function "%>%"
functions.R
file doesn’t know about dplyr
require()
functionfunctions.R
)
require()
calls at the top of your scriptSource
scriptrequire(dplyr)
> head(calcGDP(gapminder))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231
gapminder
data - but what if we want to get the data by year?functions.R
)
Source
script> source('~/Desktop/swc-r-lesson/scripts/functions.R')
> head(calcGDP(gapminder, 2002))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 2002 25268405 Asia 42.129 726.7341 18363410424
2 Albania 2002 3508512 Europe 75.651 4604.2117 16153932130
3 Algeria 2002 31287142 Africa 70.994 5288.0404 165447670333
4 Angola 2002 10866106 Africa 41.003 2773.2873 30134833901
5 Argentina 2002 38331121 Americas 74.340 8797.6407 337223430800
6 Australia 2002 19546792 Oceania 80.370 30687.7547 599847158654
> head(calcGDP(gapminder, c(1997, 2002)))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1997 22227415 Asia 41.763 635.3414 14121995875
2 Afghanistan 2002 25268405 Asia 42.129 726.7341 18363410424
3 Albania 1997 3428038 Europe 72.950 3193.0546 10945912519
4 Albania 2002 3508512 Europe 75.651 4604.2117 16153932130
5 Algeria 1997 29072015 Africa 69.152 4797.2951 139467033682
6 Algeria 2002 31287142 Africa 70.994 5288.0404 165447670333
> head(calcGDP(gapminder))
Show Traceback
Rerun with Debug
Error in filter_impl(.data, quo) :
Evaluation error: argument "year_in" is missing, with no default.
NULL
)Source
script# Calculate total GDP in gapminder data
calcGDP <- function(data, year_in=NULL) {
# Returns the gapminder data with additional column of total GDP
#
# data - gapminder dataframe
# year_in - year(s) to report data
#
# Example:
# gapminderGDP <- calcGDP(gapminder)
gdp <- gapminder %>% mutate(gdp=(pop * gdpPercap))
if (!is.null(year_in)) {
gdp <- gdp %>% filter(year %in% year_in)
}
return(gdp)
}
> source('~/Desktop/swc-r-lesson/scripts/functions.R')
> head(calcGDP(gapminder))
[1] country year pop continent lifeExp gdpPercap gdp
<0 rows> (or 0-length row.names)
> head(calcGDP(gapminder))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231
> head(calcGDP(gapminder, year_in=2002))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 2002 25268405 Asia 42.129 726.7341 18363410424
2 Albania 2002 3508512 Europe 75.651 4604.2117 16153932130
3 Algeria 2002 31287142 Africa 70.994 5288.0404 165447670333
4 Angola 2002 10866106 Africa 41.003 2773.2873 30134833901
5 Argentina 2002 38331121 Americas 74.340 8797.6407 337223430800
6 Australia 2002 19546792 Oceania 80.370 30687.7547 599847158654
Source
script# Calculate total GDP in gapminder data
calcGDP <- function(data, year_in=NULL, country_in=NULL) {
# Returns the gapminder data with additional column of total GDP
#
# data - gapminder dataframe
# year_in - year(s) to report data
#
# Example:
# gapminderGDP <- calcGDP(gapminder)
gdp <- gapminder %>% mutate(gdp=(pop * gdpPercap))
if (!is.null(year_in)) {
gdp <- gdp %>% filter(year %in% year_in)
}
if (!is.null(country_in)) {
gdp <- gdp %>% filter(country %in% country_in)
}
return(gdp)
}
> source('~/Desktop/swc-r-lesson/scripts/functions.R')
> head(calcGDP(gapminder))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231
> head(calcGDP(gapminder, 1957))
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1957 9240934 Asia 30.332 820.853 7585448670
2 Albania 1957 1476505 Europe 59.280 1942.284 2867792398
3 Algeria 1957 10270856 Africa 45.685 3013.976 30956113720
4 Angola 1957 4561361 Africa 31.999 3827.940 17460618347
5 Argentina 1957 19610538 Americas 64.399 6856.856 134466639306
6 Australia 1957 9712569 Oceania 70.330 10949.650 106349227169
> head(calcGDP(gapminder, 1957, "Egypt"))
country year pop continent lifeExp gdpPercap gdp
1 Egypt 1957 25009741 Africa 44.444 1458.915 36487093094
> head(calcGDP(gapminder, "Egypt"))
[1] country year pop continent lifeExp gdpPercap gdp
<0 rows> (or 0-length row.names)
> head(calcGDP(gapminder, country_in="Egypt"))
country year pop continent lifeExp gdpPercap gdp
1 Egypt 1952 22223309 Africa 41.893 1418.822 31530929611
2 Egypt 1957 25009741 Africa 44.444 1458.915 36487093094
3 Egypt 1962 28173309 Africa 46.992 1693.336 47706874227
4 Egypt 1967 31681188 Africa 49.293 1814.881 57497577541
5 Egypt 1972 34807417 Africa 51.137 2024.008 70450495584
6 Egypt 1977 38783863 Africa 53.319 2785.494 108032201472
# Plot grid of country life expectancy
plotLifeExp <- function(data, letter=letters, wrap=FALSE) {
# Return ggplot2 chart of life expectancy against year
#
# data - gapminder dataframe
# letter - start letters for countries
# wrap - logical: wrap graphs by country
#
# Example:
# > plotLifeExp(gapminder, c('A', 'Z'), wrap=TRUE)
starts.with <- substr(data$country, start = 1, stop = 1)
az.countries <- data[starts.with %in% letter, ]
p <- ggplot(az.countries, aes(x=year, y=lifeExp, colour=country))
p <- p + geom_line()
if (wrap) {
p <- p + facet_wrap(~country)
}
return(p)
}
SLIDE: Learning Objectives
SLIDE: Literate Programming
RStudio
SLIDE: Create an R Markdown
file
R
, literate programming is **implemented in R Markdown
filesFile
$\rightarrow$ New File
$\rightarrow$ R Markdown
Literate Programming
)Ctrl-S
) - create new subdirectory (markdown
) - literate_programming.Rmd
.Rmd
SLIDE: Components of an R Markdown
file
---
---
title: "Literate Programming"
author: "Leighton Pritchard"
date: "04/12/2017"
output: html_document
---
#
, ASTERISKS *
AND ANGLED BRACKETS <>
R
code runs in the document, and is fenced by backticks
KNIT
R
code and outputKNIT TO PDF
.pdf
document opens in a new windowKNIT TO WORD
Word
document opens upSLIDE: Creating a Report
We’ll create a report on the gapminder
data
DELETE THE EXISTING TEXT/CODE CHUNKS (literate_programming.Rmd
)
Life Expectancies
)setup
section
setup
section is run, but not shown (knit to demo)include = FALSE
#
R
to name the data usedsetup
Life expectancy in countries
)
Source
the functions.R
file to get our solution to Challenge 23 (plotLifeExp
){r echo=FALSE}
shows output but not the code---
title: "Life Expectancies"
author: "Leighton Pritchard"
date: "04/12/2017"
output:
pdf_document:
toc: true
number_sections: true
html_document:
toc: true
toc_float: true
number_sections: true
word_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Path to gapminder data
datapath <- "../data/gapminder-FiveYearData.csv"
# Letters to report on
az <- c('G', 'Y', 'R')
# Load gapminder data
gapminder <- read.csv(datapath, sep=",", header=TRUE)
# Source functions from earlier lesson
source("../scripts/functions.R")
We will present the life expectancies over time in a set of countries, using the gapminder data in the file r datapath
.
We will specifically focus on countries beginning with the letters: r az
.
r az
countriesIn countries starting with these letters, the life expectancy is as plotted below.
We use the code from our earlier challenge solution
```{r plot_function} plotLifeExp
```{r echo=FALSE}
plotLifeExp(gapminder, az, wrap=TRUE)
```