Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save nicokosi/8729408 to your computer and use it in GitHub Desktop.
Save nicokosi/8729408 to your computer and use it in GitHub Desktop.
Nicolas Kosinski's notes from Coursera course "Computing for Data Analysis" that teaches R basics for statistics: https://www.coursera.org/course/compdata
Nicolas Kosinski's notes from Coursera course "Computing for Data Analysis" that teaches R basics for statistics: https://www.coursera.org/course/compdata (session #4: https://class.coursera.org/compdata-004)
✔ R installation @done (14-01-21 10:59)
✔ Week 1 @done (14-01-22 07:55)
✔ What Makes R Different? (4:20) @done (14-01-21 12:08)
R mixes:
interactive (command-oriented) tool
programming lang
✔ How to Get Help (13:53) @done (14-01-21 12:37)
✔ Background and Overview (16:38) @done (14-01-21 15:39)
background:
dialect of S (interactive language)
S created in 1976, several versions
R created in 1991, current version: 3, S-like syntax
pros:
free (GPL)
quite lean and modular
excellent graphical calpabilities
active community
cons:
old technology (40 year-old!)
little support on 3D graphics
limitation due to memory storage
overview:
"base" packages (utils, stats, datasets etc...)
"recommended" packages (boot, class, etc...)
source: CRAN (official) / Bioconductor / personal websites...
✔ Data Types (31:06) @done (14-01-21 17:27)
5 base types: character, numeric, integer, logical and complex
vector: contains objects of same type
list: contains objects of any types
objects have attributes
assignment: <-
comment: #
range: 1-10
special numbers: NA (object not defined), NaN (number not defined, a NaN is a NA), Inf (infinity)
explicit coercion: "as.logical(0)"
vector:
created by
vector(type, length)
or
c(...) # concatenation, objects are converted to compatible type (i.e. c(1.2, "a") creates string vector)
matrix:
special kind of vector with dimension
matrix(1:6, nrow=2, ncol=3) # values send iterating by column
dimension can set after value assignment
m <- c(1:10)
dim(2, 5)
create via column or row binding
cbind(1:3, 4:6) # by column-first: 1, 4, 2, 5, 3, 6
rbind(1:3, 4:6) # by column-first: 1, 2, 3, 4, 5, 6
list:
are indexed
factor:
special kind of list
used for categories (enums) => index + label + count
treated with modeling functions lm() and glm()
counts occurrences (kind of map):
f <- factor(c("yes, "no", "yes"))
table(f) # "no" "yes"
# 1 2
attr(,"levels") # "no" "yes"
treated as integers:
unclass(f) # 2, 1
order can be chosen via construtor: factor(c("yes, "no", "yes"), levels=c("yes", "no"))
missing values:
is.na(f)
is.nan(f)
data frames:
for tabular data: list containg heterogenous objects + a row name
can be converted to matrix (=> coercion)
naming:
Object can have a name (meta-data)
# for objects
x <- 1:3
names(x) <- c("one", "two", "three")
# for matrices
m <- matrix(1:4, nrow=2, ncol=2)
dimnames(m) <- list(c("col1", "col2"), c("row1", "row2"))
✔ Subsetting (17:20) @done (14-01-21 18:18)
[*index or condition*] returns a subset with same type as original
[[]] returns a subset for multi-type containers (lists, data frames)
$ extracts elements by name (same semantics as [[]])
subsetting vectors:
v <- c("a", "b", "c", "c")
# by index:
v[1] # "a"
# filtering:
v[v > "a"] # "b", "c", "c"
subsetting matrices:
m <- matrix(1:6, 2, 3)
# by indices, into single-element vector:
m[1, 2] # 3
# by indices with missing values:
m[1,] # 1 3 5
# one element into an 1x1 matrix:
m[1, 2, drop = FALSE] # [, 1]
subsetting lists:
l <- list(ints = c(1, 2, 3), doubles = c(1.1, 2.2, 3.3))
# by name:
l$ints # 1, 2, 3
l[[ints]] # 1, 2, 3
# by name via variable:
var <- "ints"
l[[var]] # 1, 2, 3
# by name via partial matching:
l$in # 1, 2, 3
l[[in, exact = FALSE]] # 1, 2, 3
l[[in]] # NULL
# by index:
l[[c(1, 3)]] # 3
filtering NA values:
via [cond] or complete.cases() (vectors and data frames)
✔ Vectorized Operations (3:46) @done (14-01-21 18:46)
many functions can be applied in parallel on objects (vectors, etc...) without looping
v1 <- 1:4
v2 <- 6:9
v3 <- v1 + v2 # 7 9 11 13
v3 < 10 # T, T, F, F
matrix product:
m1 <- matrix(1:4, 2, 2)
m2 <- matrix(rep(10, 4), 2, 2)
m1 %*% m2 # | 40 40 | | 1 3 | | 10 10 |
# | 60 60 | = | 2 4 | . | 10 10 |
# * would mean multiply element-by-element
✔ Reading/Writing Data: Part 1 @done (14-01-21 20:12)
read/write functions for CSV-like text files:
read.table()/write.table() # default separator is "#", can have a header
read.csv() # similar to read.table but default separator is "," and header is mandatory
readLines()/writeLines()
dget()/dput() # read/write single object
source()/dump() # read/write multiple objects
load()/save() # read/write in workspace
serialize()/unserialize() # read/write binary inputs
Reading large dataset:
refer to help page (hints)
consider skipping comments (comment.char="")
help type detection via "colClasses" argument:
# hard-code type detection:
all <- read.table(colClasses=numerice)
# detect type for the first-n rows:
start <- read.table(nrows=10)
classes <- sapply(start, class)
all <- read.table(colClasses=classes)
set row number (exact or estimate):
all <- read.table(nrows)
estimate required memory:
number of cells * 8 (bytes/numeric) = estimated memory bytes (optimistic)
✔ Reading/Writing Data: Part 2 (9:25) @done (14-01-21 20:18)
Text file with metadata:
pros: human-readable, Unix philosophy, versioning...
cons: greedy (not storage efficient)
basic functions:
d <- data.frame(a=1, b="a")
d2 <- data.frame(a=2, b="c")
# serialize object:
dput(d)
# write object to a file:
dput(d, file = "d.R")
# read file to object:
d2 <- dget("foo.R")
# write objects by name to a file:
dump(c("d", "d2"), file="d_and_d2.R")
# read objects from a file:
rm(d, d2) # we can remove objects to be sure...
source("d_and_d2.R")
file reading via connection interfaces:
file
url
gzfile # open GZIP file
bzfile # open BZIP2 file
some functions such as read.cvs(filename) hides the connection interface (i.e. no need to open and close file)
connection interface may be used for partial reading (example: readLines(cnx, 10))
✔ Setting Your Working Directory and Editing R Code (Windows) [7:20] @done (14-01-22 07:44)
Basic commands related to working directory:
# work dir can be changed via menu "File | Change dir"
getwd() # show working directory, used as a base path for loading files
dir() # list files from work dir
source("foo.R") # load code from script file
ls() # lists functions and variables
dev loop : edit script, save it and load it in R Console via source("filename")
✔ The str function (6:05) @done (14-01-22 07:55)
str() displays the internal structure of an object
"what's in this object?"
many usages such as:
summarizing nested objects (vectors, lists, etc...)
displaying function signature
✔ Week 2 @done (14-01-23 08:55)
✔ Control Structures (15:23) @done (14-01-22 08:42)
loop keywords (for scripts):
if/else
for
while
repeat # infinite loop (until break is called)
break # break loop
next # go to next loop iteration
return
NB: for interactive usage, *apply functions are more useful
conditional assignment:
y <- if (x > 2) { 1 } else { 0 }
for (details):
# on range:
for (i in 1:4) {} # beware no to overwrite existing variable!
# on vector indexes:
for (i in seq_along(v)) { v[i] }
# on vector values:
for (elem in v) { elem }
# on matrix:
m <- matrix(...)
for (i in seq_len(nrow(m))) { for (j in seq_len(ncol(m))) { m[i, j] } }} # nested loops
combining expressions:
evaluated left to right
via logical operators (&&, ...)
✔ Functions (16.32) @done (14-01-22 09:50)
first-class objects:
arguments can be functions
functions can be nested
definition:
# creates an object of class "function"
f <- function(foo, bar, bazbaz = "default" { #... }
returned value is last expression
arguments:
are named
can have default value
function f(nums, best.effort = TRUE)
can be explicit defined ("formal arguments")
formals(f) returns formal arguments
matching (on caller side):
# 1. by name:
f(foo=1, bar=2, bazbaz=3)
# 2. by name with partial matching (for interactive usage, preferably):
f(fo=1, bar=2, baz=3)
# 3. by position:
f(1, 2, 3)
# mixing position and name:
# legal but beware (names args are set first and do not count for position)
f(bazbaz=3, 1, 2)
are evaluated lazily:
=> missing arg error occurs when arg is evaluated
variables arguments:
usages
extending an existing function
function (x, ...) { otherFunction(...) }
generic functions with extra arguments => function dispatch
unknown number of args
should be used for first arg
other args must be matched by exact name (partial matching is ignored))
✔ Scoping Rules (19:03) @done (14-01-22 11:26)
Bind symbol to value:
Via several environments:
1. global env # always first
2. package X # libraries loaded by user via library() are inserted here, by default
3. package Y
...
last is Base package # always last
Environments are ordered: search()
Function names != object names:
object "foo" can exist if function "foo" exists
Lexical (=static) scoping:
differs from dynamic scoping
environment:
symbol dictionary (map names to symbols)
parent
children
foo <- function() {}
environment(f) # outputs global env, for instance
closure:
closure = function + environment
free variable:
used in a function, must be defined in the same environment
function foo(a) {
# b is a free variable
a + b
}
free vars are searched in function's environment (let call it e), then in e's parent, etc... until last parent (usually: global env)
if not found => error
nested functions:
functions can be returned by other functions
they have a dedicated environment
example:
# declare and call nested function:
make.power <- function(n) {
pow <- function(x) {
x ^ n
}
pow
}
cube <- make.power(3)
cube(4)
# display content of cube's env:
ls(environment(cube))
[1] "n" "pow"
# display symbol bound to "n", for cube():
get("n", environment(cube))
[1] 3
Lexical scoping:
Free variables are searched in function definition's env
Example:
y <- 10
f <- function(x) {
y <- 2
y^2 + g(x)
}
g <- function(x) {
# y will is bound to 10 (scope of function definition)
# and not to 2 (scope of function call)
x * y
}
f(3)
[1] 34
Lexical scoping is also used in languages such as Python, Perl, Scheme, Common Lisp
Consequence: memory cost (to store all environments)
✔ Optimization Application (9:21) @done (14-01-22 12:01)
Lexical scope suits well to optimization problems
Optimization routines: optim(), nlm(), optimize(), etc...
Objective functions are implemented as a "constructor" function that has nested function
By default, objective function are for minimization (not maximization)
# Set sigma (standard deviation) and mu (mean)
optim(c(mu = 0, sigma = 1), nLL)$par
# Set sigma to 2
nLL <- make.NegLogLike(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
# Set mu to 1
nLL <- make.NegLogLike(normals, c(1, FALSE))
optimize(nLL, c(-1, 3))$minimum
✔ (Loop function:) lapply (9:23) @done (14-01-22 13:18)
lapply:
lapply() applies a function to all elements of a list, returning a list (same size as input)
lapply(object, function, ...)
If object is not a list, it will be coercized to a list (may fail)
Examples:
# sum integers:
x <- list(a = 1:5, b = 5:10)
lapply(x, mean)
$a
[1] 3
$b
[1] 7.5
# generate random numbers from 1 to 6 demonstrating extra args (...):
n <- 1:4
lapply(l, runif, min=1, max=6)
[[1]]
[1] 3.537174
[[2]]
[1] 1.221711 1.470035
[[3]]
[1] 2.304671 5.510663 4.682575
# using anonymous function:
l <- 1:2
lapply(l, function(x) x+1)
[[1]]
[1] 2
[[2]]
[1] 3
sapply is similar but returned object is simplified:
returns a vector instead of a list a one elem
returns a matrix if all elements is a list with the same size
Examples:
# vector returned:
x <- list(a = 1:5, b = 5:10)
sapply(x, mean)
a b
3.0 7.5
✔ (Loop function:) apply (7:16) @done (14-01-22 14:14)
apply(array, dimensionToRetain, function, ...)
Example:
# Sum columns (=> retain dimension 2) of a matrix with 10 rows and 4 columns
m <- matrix(1:15, 5, 3)
m
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
apply(m, 2, sum)
[1] 15 40 65
shortcuts:
rowSums(x) = apply(x, 1, sum)
rowMeans(x) = apply(x, 1, mean)
colSums(x) = apply(x, 2, sum)
colMeans(x) = apply(x, 2, mean)
Optimized for large matrices
on array:
# 3-dim array:
a <- array(1:2*2*5, c(2, 2, 5))
# sum value for dimension 3:
apply(a, c(1, 2), sum)
[,1] [,2]
[1,] 50 50
[2,] 100 100
# equivalent to:
rowMeans(a, dims=2)
✔ (loop function:) tapply and split (12:22) @done (14-01-22 15:02)
tapply:
apply a function to a subset of a vector (kind of "group by")
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
X: vector
INDEX: factor or list of factors
FUN: function to apply
simplify: should the result be simplified? (if simplify, return a vector, otherwise a list)
Example:
# vector with
# 10 random values with normal distribution with mean = 0
# 10 random values wih uniform distribution
# 10 random values with normal distribution with mean = 1
v <- c(rnorm(10), runif(10), rnorm(10, 1))
# generate 3 factors with 10 repetitions
f <- gl(3, 10)
tapply(v, f, mean)
split:
Splits a vector into several vectors using factors (groups).
Similar to tapply, but does not apply a function.
Is generaly used in conjunction with lapply...
Examples:
# split on 1 level:
# vector with
# 10 random values with normal distribution with mean = 0
# 10 random values wih uniform distribution
# 10 random values with normal distribution with mean = 1
v <- c(rnorm(10), runif(10), rnorm(10, 1))
# generate 3 factors with 10 repetitions
lapply(split(v, f), sum)
# split on multiple levels (factors):
x <- rnorm(10)
f1 <- gl(2, 5)
f2 <- gl(5, 2)
# for information, the combination of factors is
interaction(f1, f2)
#
str(split(x, list(f1, f2), drop=TRUE))
List of 6
$ 1.1: num [1:2] -0.37 -0.029
$ 1.2: num [1:2] 0.00941 -0.85438
$ 1.3: num -0.331
$ 2.3: num -1.07
$ 2.4: num [1:2] -0.165 -0.825
$ 2.5: num [1:2] -0.224 -1.61
✔ (loop function:) mapply (4:41) @done (14-01-22 15:13)
mapply() applies a function in parallel over arguments
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
Useful to "vectorize" a function which does not propose vector return.
Example:
# repeat 1 three times
> rep(1, 3)
[1] 1 1 1
# repeat 1 four times, 2 three times, etc...
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
✔ Debugging Tools: Part 1 (8:50) @done (14-01-22 15:34)
something goes wrong!
Kind of messages:
1. message: diagnostic/notification, function execution continue
2. warning: something unexpected happened, function execution continue
example: log(-1) => NaNs produced
3. error: function execution stops
example: if (x > 0) throws an error if x is NA
+ condition: generic type of issue (custom)
invisible(v) can be used in function to indicate that object is returned but should not be auto-printed to the console
✔ Debugging Tools: Part 2 (10:07) @done (14-01-22 15:52)
debugging functions:
traceback(): prints out call stack of last error
debug(f): flags a function for debug mode (breakpoint is created on first line)
brower: suspend a function for debug mode (create breakpoint anywhere in the function?)
trace: inserts debugging code into a function (can be internal packages, not written by us)
recover: modify error behaviour in order to browse call stack
✔ Debugging Tools: Part 3 (8:23) @done (14-01-22 16:08)
traceback example (dumb one!):
> mean(z)
Error in mean(z) : object 'z' not found
> traceback()
1: mean(z)
debug example:
debug(myFunction)
myFuntion("foo")
# then type 'n' + return for going to next line
recover example:
options(error = recover)
read.csv("doesNotExist")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message
In file(file, "rt")
cannot open file 'doesNotExist': No such file or directory
Enter a frame number, or 0 to exit
1: read.csv("doesNotExist")
2: read.table(file = file, header = header, sep = sep, quote = quote, dec = de
3: file(file, "rt")
✔ Week 3 @done (14-01-28 16:06)
✔ Week 3 Introduction (1:58) @done (14-01-23 14:11)
✔ Simulation (14:51) @done (14-01-23 15:47)
generating numbers:
function prefixes:
d: evaluation of density
r: generation of random numbers
p: evaluation of cumulative distribution
q: evalution of the quantile
main functions:
rnorm() generates random number for Normal distribution (input: number of points, mean (def=0) and standard deviation (def=1))
dnorm() evaluates Normal probability density at point (or some points)
pnorm() evaluates the cumulative distribution for a Normal distribution
qnorm() evaluates the quantile for a Normal distribution qnorm(p) = pnorm^-1(p)
N.B.: lower tail: left part vs upper tail: right part
rpois() generates random numbers for Poisson distribution (input: number of points and rate, output: integers)
ppois() evaluates the cumulative distribution for a Poisson distribution (input: value-inferior-or-equal-to and rate, output: probabiltiy)
rbinom() generates binary numbers
Set the seed:
set.seed(x)
Always do it (before first call) in order to ensure reproducibilty.
generating numbers from linear model:
Example:
# Suppose y = a + b * x + e
# a: 0.5
# b: 2
# e ~ N(0, 2^2)
# x ~ N(0, 1^2)
set.seed(20)
x <- rnorm(100)
e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050
# graph plots, we can see the line
plot(x, y)
generating numbers from generalized linear model:
generalized linear model means distribution is not Normal (for instance: Poisson).
Example:
# Y ~ Poission(mu)
# log mu = a + bx
# a = 0.5
# b = 0.3
set.seed(1)
x <- rnorm(100)
log.mu <- 0.5 + 0.3 * x
y <- rpois(100, exp(log.mu))
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 1.55 2.00 6.00
plot(x, y)
generating samples:
sample() retrieves rand numbers from a scalar object
no repetition by default (permutation)
repetition can be set via arg replace=TRUE
example:
> set.seed(3)
> sample(1:10, 10)
[1] 2 8 4 3 9 6 1 5 10 7
> sample(1:10, 10, replace=TRUE)
[1] 6 6 6 6 9 9 2 8 9 3
✔ Plotting with Base Graphics (23:22) @done (14-01-23 17:50)
plotting & graphic packages:
graphics: base functions such as plot(), hist(), boxplot(), etc...
lattice: trellis functions such xyplot(), bwplot(), etc (independant of graphics)
grid: low-level graphic routine (seldom used)
grDevices: graphic devices (screen output / file generation) such as X11, PDF, PNG, etc...
process:
output: screen? file?
usage: for temp screen display, a presentation, a paper?
data volume: a few points vs huge data set?
resizable?: (bitmap like JPEG or PNG vs vectorial format like PDF or PostScript)
which package?: base (simpler, built piecemeal) or grid/lattice ("sigle function call") ; cannot be mixed
graphics package:
example:
> x <- rnorm(100)
# draw histogram
> hist(x)
# close window
> dev.off()
null device
1
# draw histogram (opens new window)
> hist(x)
# draw another histogram (reuses window)
> hist(2 * x)
graphic parameters:
see par() function for setting graphic parameters for current session only
some params can be overriden in specific plotting functions
important params:
pch: plotting symbol (default: open circle)
lty: line type (default: solid line)
lwd: line width
col: plotting color (default: black)
las: orintation of axis labels
bg: background color
mar: margin sizes (from bottom to right, clockwise)
oma: outer margin sizes
mfrow: number of plots per row, column (plots filled row-wise)
mfcol: idem (plots filled column-wise)
functions:
plot: draws a plot (scatterplot usually, other types of plots depending on input object)
hist: draws a histogram
lines: adds lines to a plot
points: adds points to a plot
text: adds text
title: adds titles (axis, title, sub-title, margin
mtext: adds text to margins
axis: add axis tick marks or labels
legend: adds legend
devices:
?Devices lists the list of devices
vectorial (resizable):
pdf (resizes well, portable)
postscript (older, less used)
bitmap (+: well-suited for plots with many points, -: not well resizables):
png :good for solid color, lossless compression
jpeg: good for pictures, lossy compression
bitmap: less used
bmp: native Windows format, less used
example:
# generate plot in PDF file:
> pdf(file = "testRplot.pdf")
> x <- rnorm(100)
> hist(x)
> dev.off()
copying device:
to export to file, either: 1. open device, make plot, close device
2. make plot on default device (screen), then copy it to other device
copying device is not an exact operation!
functions: dev.copy(), dev.copy2pdf, dev.list(), dev.set(), dev.off
✔ Base Graphics Plotting Demo (16:56) @done (14-01-23 18:18)
# launch plot demo, useful to show params (plotting symbols, etc):
example(points)
# trick use different plotting symbols on the same plot:
plot(x, y, type = "n") # will draw axis, legends, etc... but not the points
points(x[group == 'Male'], y[group == 'Male'], col = "blue") # draw "Male" blue points via factor "group"
points(x[group == 'Female'], y[group == 'Female'], col = "pink") # draw "Male" pink points via factor "group"
✔ Plotting with Lattice Graphics (7:18) @done (14-01-23 18:57)
Main Lattice functions:
xyplot() for scatterplots
bwplot() for boxplots
histogram() for histograms
strpplot() for boxplots with points
dotplot()
splom() for scatterplots matrix (direcly on a dataset)
levelplot()/contourplot() for image data
example:
library(lattice)
library(nlme)
xyplot(distance ~ age | Subject, data = Orthodont)
functions do not directly print:
returned object of type "treillis" (that can be stored but it's better to save the code)
print() method call needed to draw on device (auto-printed in the console)
arguments:
first arg: formula like "y ~ x | f * g" (x, y: inputs ; f, g: optional factors)
"data" arg: data
"panel" (optional): for extra items per group (ex: draw average line via panel.abline() or regression line via panel.lmline())
✔ Lattice Graphics Plotting Demo (21:23) @done (14-01-23 19:35)
# display documentation
package ? lattice
#
library(help - lattice)
# split scalar in 4 ranges that overlap slightly
# useful to see influence between two variables and a third one
temp.cut <- equal.count(environmental$temperature, 4)
wind.cut <- equal.count(environmental$wind, 4)
xyplot(
# 2 4-level factors combined (wind and temp) => 16 panels
ozone ~ radiation | temp.cut * wind.cut,
data = environmental, as.table,
# add regression line
panel = function (x, y, ...) {
# draw plots
panel.xyplot(x, y, ...)
# add regression line
fit <- lm(y ~ x)
panel.abline(fit)
# add smoother (?)
panel.loess(x, y)
})
✔ Plotting with ggplot2: Part 1 (24:18) @done (14-01-27 09:34)
ggplot2 = CRAN package that implements "Grammar of Graphics" (book)
= "3rd graphics system" (after base and lattice)
concept: verb, noun & adjvective
workflow:
1. start with base function (ie plot)
2. use annotation functions (text, lines, points, axis) to add/modify
many automatic stuff (but customisation is possible)
basic function, qplot():
similar with base's plot()
source: data frame (always) from input (prefered) or workspace
output: aesthestics (size, shape, color) + geoms (points, lines)
input factors should be labeled
qplot(): simple function (hides complexity)
ggplot(): advanced function (more powerful and flexible)
examples:
install.packages()
library(ggplot2)
# draw plots and show sub-groups (based on 'drv' factor variable):
qplot(displ # xcoord
,hwy # y coord
,data = mpg # data frame
,color = drv # change aesthetics: color points via 'drv' factor
# legend is automatically added
)
# draw plots and smooth line ("loess"):
qplot(displ # xcoord
,hwy # y coord
,data = mpg # data frame
,geom = c("point", "smooth") # add geom: smooht line + 99% interval
)
# draw histogram:
qplot(hwy # only 1 variable
, data = mpg
, fill = drv)
# draw facets (i.e. groups, like panels for lattice):
qplot(displ, hwy, data = mpg
, facets = . ~ drv # pattern is "row var" ~ "col var", "." means empty
# here, drv has 3 levels => 3 facets
)
✔ Plotting with ggplot2: Part 2 (28:35) @done (14-01-27 17:26)
advanced function, ggplot:
example:
# prepate plots on 2 variables:
> g <- ggplot(mpg, aes(displ, hwy))
# check:
summary(g)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
class [234x11]
mapping: x = displ, y = hwy
faceting: facet_null()
# auto save and print:
p <- g + geom_point() # geom_point is a layer to draw points
> print(p)
# print only:
g + geom_point()
# print (plots) with additional smooth line:
g + geom_point() + geom_smooth()
# print (plots) with additional smooth line w/ linear model (removes noise):
g + geom_point() + geom_smooth( method = "lm" )
# print (plots) with facets:
g + geom_point() + facet_grid(. ~ cyl) + geom_smooth( method = "lm" )
annotation:
labels: xlab(), ylab(), labs(), ggtitle()
addit. drawings: geom_*() like geom_smooth, etc...
global vars: theme() (example: theme(legend.position = "none"))
appearance: theme_gray() for gray background (default), theme_bw() is a black and white UI theme
modify aesthetics:
# set point size + constant point color + alpha transparency:
g + geom_point( color = "steelblue", size = 4, alpha = 1/2)
# set point size + dynamic point color:
g + geom_point( aes(color = drv) , size = 4 )
# print (plots) with custom labels:
g + geom_point() + labs(title="my title") + labs(x = "custom x axis") + labs( y = expression ("custom y axis for " * hwy) )
# axis limit, filter outlier values:
g + geom_point() + ylim(-3, 3)
# axis limits, set outlier values out of scale (by default, all values are displayed):
g + geom_point() + coord_cartesian( ylim(-3, 3) )
Make categories over continue values:
If categorical variable is continue (many values, not just 3 or 4)
example:
quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE)
0% 33.33333% 66.66667% 100%
9 15 18 35
cutpoints <- quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE)
mpg$cty_range <- cut(mpg$cty, cutpoints)
levels(mpg$cty_range)
[1] "(9,15]" "(15,18]" "(18,35]"
✔ Plotting with Mathematical Annotation (6:03) @done (14-01-28 16:06)
Math symbols can be set in labels, plots, etc... (LateX-like symbols)
Use expression() function to write math symbols
cf ?plotmath
examples:
# create base plot with title 'theta = 1':
plot(1, 2, main = expression(theta == 1))
# create base slot with title = sum of xi * yi (epsilon symbol is used):
plot(1, 2, main = expression(sum(x[i] * y[i], i==1, n))
# expression with dynamic value (i.e. from variable):
v <- -2
# Will set title to "abs (x) = -2" (abs will be bar symbol)
plot(1, 2, main = substitute(
bar(x) == val, # the expression
list(val=v) )) # list of variables to be substituted
✔ Week 4 @done (14-01-29 10:49)
✔ Plotting and Color in R (22:06) @done (14-01-28 17:08)
default colors (1: black and white, 2: red, 3: green, etc...) are not very pretty!
grDevices package has 2 functions:
colorRamp(colors) takes colors as input and returns a function that inputs number from 0 to 1 and that returns RVG numbers
colorRampPalette() takes colors as input and returns a function that return colors (as a vector of characters for hexa RGB)
They blend colors (interpolation)
colorRamp() examples:
pal <- colorRamp(c("red", "blue"))
# RVG color for red:
pal(0)
[,1] [,2] [,3]
[1,] 255 0 0
# RVG color for blue:
pal(1)
[,1] [,2] [,3]
[1,] 0 0 255
# RVG color in between:
pal(0.5)
[,1] [,2] [,3]
[1,] 127.5 0 127.5
colorRampPalette examples:
pal <- colorRampPalette(c("red", "blue"))
# return 2 colors (red and blue):
> pal(2)
[1] "#FF0000" "#0000FF"
# return 10 colors from red to blue:
> pal(10)
[1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA"
[8] "#3800C6" "#1C00E2" "#0000FF"
colors() lists color names (instead of RGB values)
gray() is equivalent to colorAmp() for b/w
usage:
pal <- colorRampPalette(c("red", "yellow", "blue"))
x <- rnorm(100)
# plot 100 points, first will be red, 50th will be yellow, 100th will be blue, and other will be in between:
plot(x, col = pal(100))
RColorBrew package:
3 types of palettes
sequential, for continuous order data (ex: "Blues" that goes from light blue to dark blue)
diverging (example: positive vs negative values) => from dark color 1 to light (in the middle) to dark color 2 (ex: "Spectral")
qualitive, for data that are not ordered (categorical data) => each color is very different from previous one (ex: "Set1")
Main function: brew.pal(numberOfPoints, paletteName)
Can be used with ColorRamp() and ColorRampPalette()
Example:
# draw 100 plots using plot colors from light blue to dark blue via "Blues" palette:
library(RColorBrewer)
# use 3 primary colors (we don't need more) from "Blues" theme
cols <- brewer.pal(3, "Blues")
pal <- colorRampPalette(cols)
# generate 100 colors
plot(x, col = pal(100))
NB: smoothScatter(), for plotting huge number of points, uses RColorBrew package
Additionnal notes:
rgb() return RGB colors and handles transparency ('alpha' param)
Can be useful for overlapping plots:
x <- rnorm(10000)
plot(x, col = rgb(0, 0, 0, 0.1))
✔ Dates and Times (10:29) @done (14-01-28 17:53)
date:
a day in year (no time)
via Date class (number of days since 1970/01/01)
time:
date + time + timezone
POSIXct and POSIXlt classes (number of seconds since 1970/01/01)
POSIXct uses a big integer
POSIXlt is a list with additional info such as day of the week, day of the year etc... (year, month, yday, hour, min, sec, etc...)
# date from string:
d <- as.Date("2013-12-25")
# number of days since 1970:
daysSinceNow <- unclass(d)
[1] 16064
# generic functions for dates and times:
weekdays(d)
[1] "mercredi"
months(d)
[1] "décembre"
quarters(d)
[1] "Q4"
# system time:
t <- Sys.time()
dt <- as.POSIXlt(t)
d$sec
Error in d$sec : $ operator is invalid for atomic vectors
dt$sec
[1] 51.66967
data/time format:
strptime() converts one or several strings to time objects (see help for format)
date/time operations:
via standard functions: +, -, ==, >, etc...
+ conversion functions: as.Date, as.POSIXct, as.POSIXlt
✔ Regular Expressions (27:21) @done (14-01-28 18:39)
regexp = combination of literals (text) and meta-characters (ex: starts with, alternative, word boundary)
In R, a way to extract data from "unfriendly" sources (web sites, messy text files, etc...).
meta-characters:
^: line that starts with text (ex: "^foo")
$: line that ends with text (ex: "foo$")
[]: a set of characters (ex: [Nn] [Ii] [Cc] [Oo] to find 'nico', ignoring case)
can be used with ranges (ex: [0-9] or [a-z])
[^]: negative set (ex: "[^?.]$" returns line that do NOT end with '?' or '.')
.: any character, or none
|: alternatives (ex: "dev|coder" will return lines containing either "dev" or "coder")
(: to indicates scope ("^foo|bar" which returns lines that start with "foo" or contain "bar"
vs "^(foo|bar) that return lines that start with "foo" or "bar")
to store matched text ("grouping") in \1, \2, etc...
?: for optional expressions (ex: "George( [Ww]\. )?" match "George W. Bush" and "George Bush")
\: escape meta-character (ex: "\." means period, not "any character" meta-char)
+: at least one of...
*: any of... (including none)
"greedy" => matches the longest possible string
greediness can be stopped via ?
{}: custom repetition: min-max interval or min or max
(ex: "[Bb]ush( +[^ ]+ +){1, 5}) debate" will match lines having 1 to 5 words between "Bush" and "debate")
✔ Introduction to Baltimore City Homicide Data (4:20) @done (14-01-28 18:44)
✔ Regular Expressions in R (30:08) @done (14-01-28 20:26)
grep()/grepl() search in a character vector, return index numbers that match or booleans
regexpr()/grepregexpr() similar but return index of the string where match begin + length of match
regexp is for first match, gregexp is for all matches
used in conjonction with regmatches()
sub()/gsub() search and replace (sub for first match, gsub for all matches)
regexec() gives indices for sub-expressions (with parentheses)
demo:
# Count lines with "shooting":
length(grep("[Ss]hooting", homicides))
[1] 1005
# Count lines with "cause: shooting":
length(grep("[Cc]ause: [Ss]hooting", homicides))
[1] 1003
# Troubleshootig differences:
s1 <- length(grep("[Ss]hooting", homicides))
s2 <- length(grep("[Cc]ause: [Ss]hooting", homicides))
setdiff(s1, s2)
[1] 1005
s1 <- grep("[Ss]hooting", homicides)
s2 <- grep("[Cc]ause: [Ss]hooting", homicides)
setdiff(s1, s2)
[1] 318 859
setdiff(s2, s1)
integer(0)
# state that start with "New":
grep("^New", state.name) # line numbers
[1] 29 30 31 32
grep("^New", state.name, value=TRUE) # values
[1] "New Hampshire" "New Jersey" "New Mexico" "New York"
grepl("^New", state.name) # boolean vector
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
# Extract dates (initial try):
> r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5])
> regmatches(homicides[1:5], r)
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
[5] "<dd>Found on January 5, 2007</dd>"
# Extract dates (other try):
> r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5])
> m <- regmatches(homicides[1:5], r)
> d <- gsub("<dd>Found on |</dd>", "", m)
> as.Date(d[1], "%B %d, %Y")
[1] NA
# NB: should work with US settings (instead of French)
> as.Date(d, "%B %d, %Y")
[1] NA NA NA NA NA
# Using sub-esxpressions via regexec:
r <- regexec("<dd>Found on (.*?)</dd>", homicides)
m <- regmatches(homicides, r)
dates <- sapply(m, function(x) x[2])
dates[1]
[1] "January 1, 2007"
dates <- as.Date(dates, "%B %d, %Y")
# hist() can automatically handle dates
hist(dates, "month", freq=TRUE)
✔ Classes and Methods in R (34:51) @done (14-01-29 10:49)
S (and R!) supports OOP with a few specificities
2 systems:
- S3 classes and methods: informal, "old-style", easier, quick-and-dirty
- S4 classes and methods: more formal and rigorous, "new-style"
Separate systems, but can partially mix.
S4 style is in "methods" package (usually laoded by default).
Class definition: setClass()
Object creation: new()
Method: function that only operates on a certain class of objects
Generic function: function that only dispatches methods (ex: plot() dispatch to different functions depending on input data type)
help: ?Classes, ?Methods, ?setClass, ?setMethod, ?setGeneric
Generic functions:
S3:
# display "mean" function signature:
mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x065e11dc>
<environment: namespace:base>
# dispatched methods:
methods("mean")
[1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt
# display code for default method:
getS3method("mean", "default")
function (x, trim = 0, na.rm = FALSE, ...)
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
# ...
S4:
# display "show" function signature:
> show
standardGeneric for "show" defined from package "methods"
function (object)
standardGeneric("show")
<bytecode: 0x052674c4>
<environment: 0x04f907a0>
Methods may be defined for arguments: object
Use showMethods("show") for currently available ones.
(This generic function excludes non-simple inheritance; see ?setIs)
# display "show" dispatch methods:
> head(showMethods("show"))
Function: show (package methods)
object="ANY"
object="classGeneratorFunction"
object="classRepresentation"
A generic function has at least one param (the object).
If dispatch is found, method is called. Else, default method is called if exists, otherwise an error is thrown.
S3 dispatch methods should not be called directly!
S4 dispatch methods cannot be called directly.
Write your own type:
Why? To represent custom model (e.g. gene expression) that does not have built-in type
Probably need to write methods for print()/show(), summary(), plot()
For new S4 type:
1. use setClass() to define:
- name of class
- data elements, "slots"
2. use setMethod() to define:
- methods
3. check class info:
via showClass()
Example:
# define polygon:
setClass("polygon", representation(x = "numeric", y = "numeric"))
# implement plot() for polygon, and register it (side-effect) for current session:
setMethod(
"plot", # generic function
"polygon", # class name
function (x, y, ...) {
# draw points via default method
plot(x@x, x@y, type="n", ...)
# draw lines
xp <- c(x@x, x@x[1])
yp <- c(x@y, x@y[1])
lines(xp, yp)
})
# check that is has been registered:
showMethods("plot")
Function: plot (package graphics)
x="ANY"
x="polygon"
# create polygon object and call plot():
p <- new("polygon", x = c(1,2,3,4), y = c(1,2,3,1))
plot(p)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment