Created
January 31, 2014 10:05
-
-
Save nicokosi/8729408 to your computer and use it in GitHub Desktop.
Nicolas Kosinski's notes from Coursera course "Computing for Data Analysis" that teaches R basics for statistics: https://www.coursera.org/course/compdata
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Nicolas Kosinski's notes from Coursera course "Computing for Data Analysis" that teaches R basics for statistics: https://www.coursera.org/course/compdata (session #4: https://class.coursera.org/compdata-004) | |
✔ R installation @done (14-01-21 10:59) | |
✔ Week 1 @done (14-01-22 07:55) | |
✔ What Makes R Different? (4:20) @done (14-01-21 12:08) | |
R mixes: | |
interactive (command-oriented) tool | |
programming lang | |
✔ How to Get Help (13:53) @done (14-01-21 12:37) | |
✔ Background and Overview (16:38) @done (14-01-21 15:39) | |
background: | |
dialect of S (interactive language) | |
S created in 1976, several versions | |
R created in 1991, current version: 3, S-like syntax | |
pros: | |
free (GPL) | |
quite lean and modular | |
excellent graphical calpabilities | |
active community | |
cons: | |
old technology (40 year-old!) | |
little support on 3D graphics | |
limitation due to memory storage | |
overview: | |
"base" packages (utils, stats, datasets etc...) | |
"recommended" packages (boot, class, etc...) | |
source: CRAN (official) / Bioconductor / personal websites... | |
✔ Data Types (31:06) @done (14-01-21 17:27) | |
5 base types: character, numeric, integer, logical and complex | |
vector: contains objects of same type | |
list: contains objects of any types | |
objects have attributes | |
assignment: <- | |
comment: # | |
range: 1-10 | |
special numbers: NA (object not defined), NaN (number not defined, a NaN is a NA), Inf (infinity) | |
explicit coercion: "as.logical(0)" | |
vector: | |
created by | |
vector(type, length) | |
or | |
c(...) # concatenation, objects are converted to compatible type (i.e. c(1.2, "a") creates string vector) | |
matrix: | |
special kind of vector with dimension | |
matrix(1:6, nrow=2, ncol=3) # values send iterating by column | |
dimension can set after value assignment | |
m <- c(1:10) | |
dim(2, 5) | |
create via column or row binding | |
cbind(1:3, 4:6) # by column-first: 1, 4, 2, 5, 3, 6 | |
rbind(1:3, 4:6) # by column-first: 1, 2, 3, 4, 5, 6 | |
list: | |
are indexed | |
factor: | |
special kind of list | |
used for categories (enums) => index + label + count | |
treated with modeling functions lm() and glm() | |
counts occurrences (kind of map): | |
f <- factor(c("yes, "no", "yes")) | |
table(f) # "no" "yes" | |
# 1 2 | |
attr(,"levels") # "no" "yes" | |
treated as integers: | |
unclass(f) # 2, 1 | |
order can be chosen via construtor: factor(c("yes, "no", "yes"), levels=c("yes", "no")) | |
missing values: | |
is.na(f) | |
is.nan(f) | |
data frames: | |
for tabular data: list containg heterogenous objects + a row name | |
can be converted to matrix (=> coercion) | |
naming: | |
Object can have a name (meta-data) | |
# for objects | |
x <- 1:3 | |
names(x) <- c("one", "two", "three") | |
# for matrices | |
m <- matrix(1:4, nrow=2, ncol=2) | |
dimnames(m) <- list(c("col1", "col2"), c("row1", "row2")) | |
✔ Subsetting (17:20) @done (14-01-21 18:18) | |
[*index or condition*] returns a subset with same type as original | |
[[]] returns a subset for multi-type containers (lists, data frames) | |
$ extracts elements by name (same semantics as [[]]) | |
subsetting vectors: | |
v <- c("a", "b", "c", "c") | |
# by index: | |
v[1] # "a" | |
# filtering: | |
v[v > "a"] # "b", "c", "c" | |
subsetting matrices: | |
m <- matrix(1:6, 2, 3) | |
# by indices, into single-element vector: | |
m[1, 2] # 3 | |
# by indices with missing values: | |
m[1,] # 1 3 5 | |
# one element into an 1x1 matrix: | |
m[1, 2, drop = FALSE] # [, 1] | |
subsetting lists: | |
l <- list(ints = c(1, 2, 3), doubles = c(1.1, 2.2, 3.3)) | |
# by name: | |
l$ints # 1, 2, 3 | |
l[[ints]] # 1, 2, 3 | |
# by name via variable: | |
var <- "ints" | |
l[[var]] # 1, 2, 3 | |
# by name via partial matching: | |
l$in # 1, 2, 3 | |
l[[in, exact = FALSE]] # 1, 2, 3 | |
l[[in]] # NULL | |
# by index: | |
l[[c(1, 3)]] # 3 | |
filtering NA values: | |
via [cond] or complete.cases() (vectors and data frames) | |
✔ Vectorized Operations (3:46) @done (14-01-21 18:46) | |
many functions can be applied in parallel on objects (vectors, etc...) without looping | |
v1 <- 1:4 | |
v2 <- 6:9 | |
v3 <- v1 + v2 # 7 9 11 13 | |
v3 < 10 # T, T, F, F | |
matrix product: | |
m1 <- matrix(1:4, 2, 2) | |
m2 <- matrix(rep(10, 4), 2, 2) | |
m1 %*% m2 # | 40 40 | | 1 3 | | 10 10 | | |
# | 60 60 | = | 2 4 | . | 10 10 | | |
# * would mean multiply element-by-element | |
✔ Reading/Writing Data: Part 1 @done (14-01-21 20:12) | |
read/write functions for CSV-like text files: | |
read.table()/write.table() # default separator is "#", can have a header | |
read.csv() # similar to read.table but default separator is "," and header is mandatory | |
readLines()/writeLines() | |
dget()/dput() # read/write single object | |
source()/dump() # read/write multiple objects | |
load()/save() # read/write in workspace | |
serialize()/unserialize() # read/write binary inputs | |
Reading large dataset: | |
refer to help page (hints) | |
consider skipping comments (comment.char="") | |
help type detection via "colClasses" argument: | |
# hard-code type detection: | |
all <- read.table(colClasses=numerice) | |
# detect type for the first-n rows: | |
start <- read.table(nrows=10) | |
classes <- sapply(start, class) | |
all <- read.table(colClasses=classes) | |
set row number (exact or estimate): | |
all <- read.table(nrows) | |
estimate required memory: | |
number of cells * 8 (bytes/numeric) = estimated memory bytes (optimistic) | |
✔ Reading/Writing Data: Part 2 (9:25) @done (14-01-21 20:18) | |
Text file with metadata: | |
pros: human-readable, Unix philosophy, versioning... | |
cons: greedy (not storage efficient) | |
basic functions: | |
d <- data.frame(a=1, b="a") | |
d2 <- data.frame(a=2, b="c") | |
# serialize object: | |
dput(d) | |
# write object to a file: | |
dput(d, file = "d.R") | |
# read file to object: | |
d2 <- dget("foo.R") | |
# write objects by name to a file: | |
dump(c("d", "d2"), file="d_and_d2.R") | |
# read objects from a file: | |
rm(d, d2) # we can remove objects to be sure... | |
source("d_and_d2.R") | |
file reading via connection interfaces: | |
file | |
url | |
gzfile # open GZIP file | |
bzfile # open BZIP2 file | |
some functions such as read.cvs(filename) hides the connection interface (i.e. no need to open and close file) | |
connection interface may be used for partial reading (example: readLines(cnx, 10)) | |
✔ Setting Your Working Directory and Editing R Code (Windows) [7:20] @done (14-01-22 07:44) | |
Basic commands related to working directory: | |
# work dir can be changed via menu "File | Change dir" | |
getwd() # show working directory, used as a base path for loading files | |
dir() # list files from work dir | |
source("foo.R") # load code from script file | |
ls() # lists functions and variables | |
dev loop : edit script, save it and load it in R Console via source("filename") | |
✔ The str function (6:05) @done (14-01-22 07:55) | |
str() displays the internal structure of an object | |
"what's in this object?" | |
many usages such as: | |
summarizing nested objects (vectors, lists, etc...) | |
displaying function signature | |
✔ Week 2 @done (14-01-23 08:55) | |
✔ Control Structures (15:23) @done (14-01-22 08:42) | |
loop keywords (for scripts): | |
if/else | |
for | |
while | |
repeat # infinite loop (until break is called) | |
break # break loop | |
next # go to next loop iteration | |
return | |
NB: for interactive usage, *apply functions are more useful | |
conditional assignment: | |
y <- if (x > 2) { 1 } else { 0 } | |
for (details): | |
# on range: | |
for (i in 1:4) {} # beware no to overwrite existing variable! | |
# on vector indexes: | |
for (i in seq_along(v)) { v[i] } | |
# on vector values: | |
for (elem in v) { elem } | |
# on matrix: | |
m <- matrix(...) | |
for (i in seq_len(nrow(m))) { for (j in seq_len(ncol(m))) { m[i, j] } }} # nested loops | |
combining expressions: | |
evaluated left to right | |
via logical operators (&&, ...) | |
✔ Functions (16.32) @done (14-01-22 09:50) | |
first-class objects: | |
arguments can be functions | |
functions can be nested | |
definition: | |
# creates an object of class "function" | |
f <- function(foo, bar, bazbaz = "default" { #... } | |
returned value is last expression | |
arguments: | |
are named | |
can have default value | |
function f(nums, best.effort = TRUE) | |
can be explicit defined ("formal arguments") | |
formals(f) returns formal arguments | |
matching (on caller side): | |
# 1. by name: | |
f(foo=1, bar=2, bazbaz=3) | |
# 2. by name with partial matching (for interactive usage, preferably): | |
f(fo=1, bar=2, baz=3) | |
# 3. by position: | |
f(1, 2, 3) | |
# mixing position and name: | |
# legal but beware (names args are set first and do not count for position) | |
f(bazbaz=3, 1, 2) | |
are evaluated lazily: | |
=> missing arg error occurs when arg is evaluated | |
variables arguments: | |
usages | |
extending an existing function | |
function (x, ...) { otherFunction(...) } | |
generic functions with extra arguments => function dispatch | |
unknown number of args | |
should be used for first arg | |
other args must be matched by exact name (partial matching is ignored)) | |
✔ Scoping Rules (19:03) @done (14-01-22 11:26) | |
Bind symbol to value: | |
Via several environments: | |
1. global env # always first | |
2. package X # libraries loaded by user via library() are inserted here, by default | |
3. package Y | |
... | |
last is Base package # always last | |
Environments are ordered: search() | |
Function names != object names: | |
object "foo" can exist if function "foo" exists | |
Lexical (=static) scoping: | |
differs from dynamic scoping | |
environment: | |
symbol dictionary (map names to symbols) | |
parent | |
children | |
foo <- function() {} | |
environment(f) # outputs global env, for instance | |
closure: | |
closure = function + environment | |
free variable: | |
used in a function, must be defined in the same environment | |
function foo(a) { | |
# b is a free variable | |
a + b | |
} | |
free vars are searched in function's environment (let call it e), then in e's parent, etc... until last parent (usually: global env) | |
if not found => error | |
nested functions: | |
functions can be returned by other functions | |
they have a dedicated environment | |
example: | |
# declare and call nested function: | |
make.power <- function(n) { | |
pow <- function(x) { | |
x ^ n | |
} | |
pow | |
} | |
cube <- make.power(3) | |
cube(4) | |
# display content of cube's env: | |
ls(environment(cube)) | |
[1] "n" "pow" | |
# display symbol bound to "n", for cube(): | |
get("n", environment(cube)) | |
[1] 3 | |
Lexical scoping: | |
Free variables are searched in function definition's env | |
Example: | |
y <- 10 | |
f <- function(x) { | |
y <- 2 | |
y^2 + g(x) | |
} | |
g <- function(x) { | |
# y will is bound to 10 (scope of function definition) | |
# and not to 2 (scope of function call) | |
x * y | |
} | |
f(3) | |
[1] 34 | |
Lexical scoping is also used in languages such as Python, Perl, Scheme, Common Lisp | |
Consequence: memory cost (to store all environments) | |
✔ Optimization Application (9:21) @done (14-01-22 12:01) | |
Lexical scope suits well to optimization problems | |
Optimization routines: optim(), nlm(), optimize(), etc... | |
Objective functions are implemented as a "constructor" function that has nested function | |
By default, objective function are for minimization (not maximization) | |
# Set sigma (standard deviation) and mu (mean) | |
optim(c(mu = 0, sigma = 1), nLL)$par | |
# Set sigma to 2 | |
nLL <- make.NegLogLike(normals, c(FALSE, 2)) | |
optimize(nLL, c(-1, 3))$minimum | |
# Set mu to 1 | |
nLL <- make.NegLogLike(normals, c(1, FALSE)) | |
optimize(nLL, c(-1, 3))$minimum | |
✔ (Loop function:) lapply (9:23) @done (14-01-22 13:18) | |
lapply: | |
lapply() applies a function to all elements of a list, returning a list (same size as input) | |
lapply(object, function, ...) | |
If object is not a list, it will be coercized to a list (may fail) | |
Examples: | |
# sum integers: | |
x <- list(a = 1:5, b = 5:10) | |
lapply(x, mean) | |
$a | |
[1] 3 | |
$b | |
[1] 7.5 | |
# generate random numbers from 1 to 6 demonstrating extra args (...): | |
n <- 1:4 | |
lapply(l, runif, min=1, max=6) | |
[[1]] | |
[1] 3.537174 | |
[[2]] | |
[1] 1.221711 1.470035 | |
[[3]] | |
[1] 2.304671 5.510663 4.682575 | |
# using anonymous function: | |
l <- 1:2 | |
lapply(l, function(x) x+1) | |
[[1]] | |
[1] 2 | |
[[2]] | |
[1] 3 | |
sapply is similar but returned object is simplified: | |
returns a vector instead of a list a one elem | |
returns a matrix if all elements is a list with the same size | |
Examples: | |
# vector returned: | |
x <- list(a = 1:5, b = 5:10) | |
sapply(x, mean) | |
a b | |
3.0 7.5 | |
✔ (Loop function:) apply (7:16) @done (14-01-22 14:14) | |
apply(array, dimensionToRetain, function, ...) | |
Example: | |
# Sum columns (=> retain dimension 2) of a matrix with 10 rows and 4 columns | |
m <- matrix(1:15, 5, 3) | |
m | |
[,1] [,2] [,3] | |
[1,] 1 6 11 | |
[2,] 2 7 12 | |
[3,] 3 8 13 | |
[4,] 4 9 14 | |
[5,] 5 10 15 | |
apply(m, 2, sum) | |
[1] 15 40 65 | |
shortcuts: | |
rowSums(x) = apply(x, 1, sum) | |
rowMeans(x) = apply(x, 1, mean) | |
colSums(x) = apply(x, 2, sum) | |
colMeans(x) = apply(x, 2, mean) | |
Optimized for large matrices | |
on array: | |
# 3-dim array: | |
a <- array(1:2*2*5, c(2, 2, 5)) | |
# sum value for dimension 3: | |
apply(a, c(1, 2), sum) | |
[,1] [,2] | |
[1,] 50 50 | |
[2,] 100 100 | |
# equivalent to: | |
rowMeans(a, dims=2) | |
✔ (loop function:) tapply and split (12:22) @done (14-01-22 15:02) | |
tapply: | |
apply a function to a subset of a vector (kind of "group by") | |
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) | |
X: vector | |
INDEX: factor or list of factors | |
FUN: function to apply | |
simplify: should the result be simplified? (if simplify, return a vector, otherwise a list) | |
Example: | |
# vector with | |
# 10 random values with normal distribution with mean = 0 | |
# 10 random values wih uniform distribution | |
# 10 random values with normal distribution with mean = 1 | |
v <- c(rnorm(10), runif(10), rnorm(10, 1)) | |
# generate 3 factors with 10 repetitions | |
f <- gl(3, 10) | |
tapply(v, f, mean) | |
split: | |
Splits a vector into several vectors using factors (groups). | |
Similar to tapply, but does not apply a function. | |
Is generaly used in conjunction with lapply... | |
Examples: | |
# split on 1 level: | |
# vector with | |
# 10 random values with normal distribution with mean = 0 | |
# 10 random values wih uniform distribution | |
# 10 random values with normal distribution with mean = 1 | |
v <- c(rnorm(10), runif(10), rnorm(10, 1)) | |
# generate 3 factors with 10 repetitions | |
lapply(split(v, f), sum) | |
# split on multiple levels (factors): | |
x <- rnorm(10) | |
f1 <- gl(2, 5) | |
f2 <- gl(5, 2) | |
# for information, the combination of factors is | |
interaction(f1, f2) | |
# | |
str(split(x, list(f1, f2), drop=TRUE)) | |
List of 6 | |
$ 1.1: num [1:2] -0.37 -0.029 | |
$ 1.2: num [1:2] 0.00941 -0.85438 | |
$ 1.3: num -0.331 | |
$ 2.3: num -1.07 | |
$ 2.4: num [1:2] -0.165 -0.825 | |
$ 2.5: num [1:2] -0.224 -1.61 | |
✔ (loop function:) mapply (4:41) @done (14-01-22 15:13) | |
mapply() applies a function in parallel over arguments | |
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) | |
Useful to "vectorize" a function which does not propose vector return. | |
Example: | |
# repeat 1 three times | |
> rep(1, 3) | |
[1] 1 1 1 | |
# repeat 1 four times, 2 three times, etc... | |
> mapply(rep, 1:4, 4:1) | |
[[1]] | |
[1] 1 1 1 1 | |
[[2]] | |
[1] 2 2 2 | |
[[3]] | |
[1] 3 3 | |
[[4]] | |
[1] 4 | |
✔ Debugging Tools: Part 1 (8:50) @done (14-01-22 15:34) | |
something goes wrong! | |
Kind of messages: | |
1. message: diagnostic/notification, function execution continue | |
2. warning: something unexpected happened, function execution continue | |
example: log(-1) => NaNs produced | |
3. error: function execution stops | |
example: if (x > 0) throws an error if x is NA | |
+ condition: generic type of issue (custom) | |
invisible(v) can be used in function to indicate that object is returned but should not be auto-printed to the console | |
✔ Debugging Tools: Part 2 (10:07) @done (14-01-22 15:52) | |
debugging functions: | |
traceback(): prints out call stack of last error | |
debug(f): flags a function for debug mode (breakpoint is created on first line) | |
brower: suspend a function for debug mode (create breakpoint anywhere in the function?) | |
trace: inserts debugging code into a function (can be internal packages, not written by us) | |
recover: modify error behaviour in order to browse call stack | |
✔ Debugging Tools: Part 3 (8:23) @done (14-01-22 16:08) | |
traceback example (dumb one!): | |
> mean(z) | |
Error in mean(z) : object 'z' not found | |
> traceback() | |
1: mean(z) | |
debug example: | |
debug(myFunction) | |
myFuntion("foo") | |
# then type 'n' + return for going to next line | |
recover example: | |
options(error = recover) | |
read.csv("doesNotExist") | |
Error in file(file, "rt") : cannot open the connection | |
In addition: Warning message | |
In file(file, "rt") | |
cannot open file 'doesNotExist': No such file or directory | |
Enter a frame number, or 0 to exit | |
1: read.csv("doesNotExist") | |
2: read.table(file = file, header = header, sep = sep, quote = quote, dec = de | |
3: file(file, "rt") | |
✔ Week 3 @done (14-01-28 16:06) | |
✔ Week 3 Introduction (1:58) @done (14-01-23 14:11) | |
✔ Simulation (14:51) @done (14-01-23 15:47) | |
generating numbers: | |
function prefixes: | |
d: evaluation of density | |
r: generation of random numbers | |
p: evaluation of cumulative distribution | |
q: evalution of the quantile | |
main functions: | |
rnorm() generates random number for Normal distribution (input: number of points, mean (def=0) and standard deviation (def=1)) | |
dnorm() evaluates Normal probability density at point (or some points) | |
pnorm() evaluates the cumulative distribution for a Normal distribution | |
qnorm() evaluates the quantile for a Normal distribution qnorm(p) = pnorm^-1(p) | |
N.B.: lower tail: left part vs upper tail: right part | |
rpois() generates random numbers for Poisson distribution (input: number of points and rate, output: integers) | |
ppois() evaluates the cumulative distribution for a Poisson distribution (input: value-inferior-or-equal-to and rate, output: probabiltiy) | |
rbinom() generates binary numbers | |
Set the seed: | |
set.seed(x) | |
Always do it (before first call) in order to ensure reproducibilty. | |
generating numbers from linear model: | |
Example: | |
# Suppose y = a + b * x + e | |
# a: 0.5 | |
# b: 2 | |
# e ~ N(0, 2^2) | |
# x ~ N(0, 1^2) | |
set.seed(20) | |
x <- rnorm(100) | |
e <- rnorm(100, 0, 2) | |
y <- 0.5 + 2 * x + e | |
summary(y) | |
Min. 1st Qu. Median Mean 3rd Qu. Max. | |
-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050 | |
# graph plots, we can see the line | |
plot(x, y) | |
generating numbers from generalized linear model: | |
generalized linear model means distribution is not Normal (for instance: Poisson). | |
Example: | |
# Y ~ Poission(mu) | |
# log mu = a + bx | |
# a = 0.5 | |
# b = 0.3 | |
set.seed(1) | |
x <- rnorm(100) | |
log.mu <- 0.5 + 0.3 * x | |
y <- rpois(100, exp(log.mu)) | |
summary(y) | |
Min. 1st Qu. Median Mean 3rd Qu. Max. | |
0.00 1.00 1.00 1.55 2.00 6.00 | |
plot(x, y) | |
generating samples: | |
sample() retrieves rand numbers from a scalar object | |
no repetition by default (permutation) | |
repetition can be set via arg replace=TRUE | |
example: | |
> set.seed(3) | |
> sample(1:10, 10) | |
[1] 2 8 4 3 9 6 1 5 10 7 | |
> sample(1:10, 10, replace=TRUE) | |
[1] 6 6 6 6 9 9 2 8 9 3 | |
✔ Plotting with Base Graphics (23:22) @done (14-01-23 17:50) | |
plotting & graphic packages: | |
graphics: base functions such as plot(), hist(), boxplot(), etc... | |
lattice: trellis functions such xyplot(), bwplot(), etc (independant of graphics) | |
grid: low-level graphic routine (seldom used) | |
grDevices: graphic devices (screen output / file generation) such as X11, PDF, PNG, etc... | |
process: | |
output: screen? file? | |
usage: for temp screen display, a presentation, a paper? | |
data volume: a few points vs huge data set? | |
resizable?: (bitmap like JPEG or PNG vs vectorial format like PDF or PostScript) | |
which package?: base (simpler, built piecemeal) or grid/lattice ("sigle function call") ; cannot be mixed | |
graphics package: | |
example: | |
> x <- rnorm(100) | |
# draw histogram | |
> hist(x) | |
# close window | |
> dev.off() | |
null device | |
1 | |
# draw histogram (opens new window) | |
> hist(x) | |
# draw another histogram (reuses window) | |
> hist(2 * x) | |
graphic parameters: | |
see par() function for setting graphic parameters for current session only | |
some params can be overriden in specific plotting functions | |
important params: | |
pch: plotting symbol (default: open circle) | |
lty: line type (default: solid line) | |
lwd: line width | |
col: plotting color (default: black) | |
las: orintation of axis labels | |
bg: background color | |
mar: margin sizes (from bottom to right, clockwise) | |
oma: outer margin sizes | |
mfrow: number of plots per row, column (plots filled row-wise) | |
mfcol: idem (plots filled column-wise) | |
functions: | |
plot: draws a plot (scatterplot usually, other types of plots depending on input object) | |
hist: draws a histogram | |
lines: adds lines to a plot | |
points: adds points to a plot | |
text: adds text | |
title: adds titles (axis, title, sub-title, margin | |
mtext: adds text to margins | |
axis: add axis tick marks or labels | |
legend: adds legend | |
devices: | |
?Devices lists the list of devices | |
vectorial (resizable): | |
pdf (resizes well, portable) | |
postscript (older, less used) | |
bitmap (+: well-suited for plots with many points, -: not well resizables): | |
png :good for solid color, lossless compression | |
jpeg: good for pictures, lossy compression | |
bitmap: less used | |
bmp: native Windows format, less used | |
example: | |
# generate plot in PDF file: | |
> pdf(file = "testRplot.pdf") | |
> x <- rnorm(100) | |
> hist(x) | |
> dev.off() | |
copying device: | |
to export to file, either: 1. open device, make plot, close device | |
2. make plot on default device (screen), then copy it to other device | |
copying device is not an exact operation! | |
functions: dev.copy(), dev.copy2pdf, dev.list(), dev.set(), dev.off | |
✔ Base Graphics Plotting Demo (16:56) @done (14-01-23 18:18) | |
# launch plot demo, useful to show params (plotting symbols, etc): | |
example(points) | |
# trick use different plotting symbols on the same plot: | |
plot(x, y, type = "n") # will draw axis, legends, etc... but not the points | |
points(x[group == 'Male'], y[group == 'Male'], col = "blue") # draw "Male" blue points via factor "group" | |
points(x[group == 'Female'], y[group == 'Female'], col = "pink") # draw "Male" pink points via factor "group" | |
✔ Plotting with Lattice Graphics (7:18) @done (14-01-23 18:57) | |
Main Lattice functions: | |
xyplot() for scatterplots | |
bwplot() for boxplots | |
histogram() for histograms | |
strpplot() for boxplots with points | |
dotplot() | |
splom() for scatterplots matrix (direcly on a dataset) | |
levelplot()/contourplot() for image data | |
example: | |
library(lattice) | |
library(nlme) | |
xyplot(distance ~ age | Subject, data = Orthodont) | |
functions do not directly print: | |
returned object of type "treillis" (that can be stored but it's better to save the code) | |
print() method call needed to draw on device (auto-printed in the console) | |
arguments: | |
first arg: formula like "y ~ x | f * g" (x, y: inputs ; f, g: optional factors) | |
"data" arg: data | |
"panel" (optional): for extra items per group (ex: draw average line via panel.abline() or regression line via panel.lmline()) | |
✔ Lattice Graphics Plotting Demo (21:23) @done (14-01-23 19:35) | |
# display documentation | |
package ? lattice | |
# | |
library(help - lattice) | |
# split scalar in 4 ranges that overlap slightly | |
# useful to see influence between two variables and a third one | |
temp.cut <- equal.count(environmental$temperature, 4) | |
wind.cut <- equal.count(environmental$wind, 4) | |
xyplot( | |
# 2 4-level factors combined (wind and temp) => 16 panels | |
ozone ~ radiation | temp.cut * wind.cut, | |
data = environmental, as.table, | |
# add regression line | |
panel = function (x, y, ...) { | |
# draw plots | |
panel.xyplot(x, y, ...) | |
# add regression line | |
fit <- lm(y ~ x) | |
panel.abline(fit) | |
# add smoother (?) | |
panel.loess(x, y) | |
}) | |
✔ Plotting with ggplot2: Part 1 (24:18) @done (14-01-27 09:34) | |
ggplot2 = CRAN package that implements "Grammar of Graphics" (book) | |
= "3rd graphics system" (after base and lattice) | |
concept: verb, noun & adjvective | |
workflow: | |
1. start with base function (ie plot) | |
2. use annotation functions (text, lines, points, axis) to add/modify | |
many automatic stuff (but customisation is possible) | |
basic function, qplot(): | |
similar with base's plot() | |
source: data frame (always) from input (prefered) or workspace | |
output: aesthestics (size, shape, color) + geoms (points, lines) | |
input factors should be labeled | |
qplot(): simple function (hides complexity) | |
ggplot(): advanced function (more powerful and flexible) | |
examples: | |
install.packages() | |
library(ggplot2) | |
# draw plots and show sub-groups (based on 'drv' factor variable): | |
qplot(displ # xcoord | |
,hwy # y coord | |
,data = mpg # data frame | |
,color = drv # change aesthetics: color points via 'drv' factor | |
# legend is automatically added | |
) | |
# draw plots and smooth line ("loess"): | |
qplot(displ # xcoord | |
,hwy # y coord | |
,data = mpg # data frame | |
,geom = c("point", "smooth") # add geom: smooht line + 99% interval | |
) | |
# draw histogram: | |
qplot(hwy # only 1 variable | |
, data = mpg | |
, fill = drv) | |
# draw facets (i.e. groups, like panels for lattice): | |
qplot(displ, hwy, data = mpg | |
, facets = . ~ drv # pattern is "row var" ~ "col var", "." means empty | |
# here, drv has 3 levels => 3 facets | |
) | |
✔ Plotting with ggplot2: Part 2 (28:35) @done (14-01-27 17:26) | |
advanced function, ggplot: | |
example: | |
# prepate plots on 2 variables: | |
> g <- ggplot(mpg, aes(displ, hwy)) | |
# check: | |
summary(g) | |
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, | |
class [234x11] | |
mapping: x = displ, y = hwy | |
faceting: facet_null() | |
# auto save and print: | |
p <- g + geom_point() # geom_point is a layer to draw points | |
> print(p) | |
# print only: | |
g + geom_point() | |
# print (plots) with additional smooth line: | |
g + geom_point() + geom_smooth() | |
# print (plots) with additional smooth line w/ linear model (removes noise): | |
g + geom_point() + geom_smooth( method = "lm" ) | |
# print (plots) with facets: | |
g + geom_point() + facet_grid(. ~ cyl) + geom_smooth( method = "lm" ) | |
annotation: | |
labels: xlab(), ylab(), labs(), ggtitle() | |
addit. drawings: geom_*() like geom_smooth, etc... | |
global vars: theme() (example: theme(legend.position = "none")) | |
appearance: theme_gray() for gray background (default), theme_bw() is a black and white UI theme | |
modify aesthetics: | |
# set point size + constant point color + alpha transparency: | |
g + geom_point( color = "steelblue", size = 4, alpha = 1/2) | |
# set point size + dynamic point color: | |
g + geom_point( aes(color = drv) , size = 4 ) | |
# print (plots) with custom labels: | |
g + geom_point() + labs(title="my title") + labs(x = "custom x axis") + labs( y = expression ("custom y axis for " * hwy) ) | |
# axis limit, filter outlier values: | |
g + geom_point() + ylim(-3, 3) | |
# axis limits, set outlier values out of scale (by default, all values are displayed): | |
g + geom_point() + coord_cartesian( ylim(-3, 3) ) | |
Make categories over continue values: | |
If categorical variable is continue (many values, not just 3 or 4) | |
example: | |
quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE) | |
0% 33.33333% 66.66667% 100% | |
9 15 18 35 | |
cutpoints <- quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE) | |
mpg$cty_range <- cut(mpg$cty, cutpoints) | |
levels(mpg$cty_range) | |
[1] "(9,15]" "(15,18]" "(18,35]" | |
✔ Plotting with Mathematical Annotation (6:03) @done (14-01-28 16:06) | |
Math symbols can be set in labels, plots, etc... (LateX-like symbols) | |
Use expression() function to write math symbols | |
cf ?plotmath | |
examples: | |
# create base plot with title 'theta = 1': | |
plot(1, 2, main = expression(theta == 1)) | |
# create base slot with title = sum of xi * yi (epsilon symbol is used): | |
plot(1, 2, main = expression(sum(x[i] * y[i], i==1, n)) | |
# expression with dynamic value (i.e. from variable): | |
v <- -2 | |
# Will set title to "abs (x) = -2" (abs will be bar symbol) | |
plot(1, 2, main = substitute( | |
bar(x) == val, # the expression | |
list(val=v) )) # list of variables to be substituted | |
✔ Week 4 @done (14-01-29 10:49) | |
✔ Plotting and Color in R (22:06) @done (14-01-28 17:08) | |
default colors (1: black and white, 2: red, 3: green, etc...) are not very pretty! | |
grDevices package has 2 functions: | |
colorRamp(colors) takes colors as input and returns a function that inputs number from 0 to 1 and that returns RVG numbers | |
colorRampPalette() takes colors as input and returns a function that return colors (as a vector of characters for hexa RGB) | |
They blend colors (interpolation) | |
colorRamp() examples: | |
pal <- colorRamp(c("red", "blue")) | |
# RVG color for red: | |
pal(0) | |
[,1] [,2] [,3] | |
[1,] 255 0 0 | |
# RVG color for blue: | |
pal(1) | |
[,1] [,2] [,3] | |
[1,] 0 0 255 | |
# RVG color in between: | |
pal(0.5) | |
[,1] [,2] [,3] | |
[1,] 127.5 0 127.5 | |
colorRampPalette examples: | |
pal <- colorRampPalette(c("red", "blue")) | |
# return 2 colors (red and blue): | |
> pal(2) | |
[1] "#FF0000" "#0000FF" | |
# return 10 colors from red to blue: | |
> pal(10) | |
[1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA" | |
[8] "#3800C6" "#1C00E2" "#0000FF" | |
colors() lists color names (instead of RGB values) | |
gray() is equivalent to colorAmp() for b/w | |
usage: | |
pal <- colorRampPalette(c("red", "yellow", "blue")) | |
x <- rnorm(100) | |
# plot 100 points, first will be red, 50th will be yellow, 100th will be blue, and other will be in between: | |
plot(x, col = pal(100)) | |
RColorBrew package: | |
3 types of palettes | |
sequential, for continuous order data (ex: "Blues" that goes from light blue to dark blue) | |
diverging (example: positive vs negative values) => from dark color 1 to light (in the middle) to dark color 2 (ex: "Spectral") | |
qualitive, for data that are not ordered (categorical data) => each color is very different from previous one (ex: "Set1") | |
Main function: brew.pal(numberOfPoints, paletteName) | |
Can be used with ColorRamp() and ColorRampPalette() | |
Example: | |
# draw 100 plots using plot colors from light blue to dark blue via "Blues" palette: | |
library(RColorBrewer) | |
# use 3 primary colors (we don't need more) from "Blues" theme | |
cols <- brewer.pal(3, "Blues") | |
pal <- colorRampPalette(cols) | |
# generate 100 colors | |
plot(x, col = pal(100)) | |
NB: smoothScatter(), for plotting huge number of points, uses RColorBrew package | |
Additionnal notes: | |
rgb() return RGB colors and handles transparency ('alpha' param) | |
Can be useful for overlapping plots: | |
x <- rnorm(10000) | |
plot(x, col = rgb(0, 0, 0, 0.1)) | |
✔ Dates and Times (10:29) @done (14-01-28 17:53) | |
date: | |
a day in year (no time) | |
via Date class (number of days since 1970/01/01) | |
time: | |
date + time + timezone | |
POSIXct and POSIXlt classes (number of seconds since 1970/01/01) | |
POSIXct uses a big integer | |
POSIXlt is a list with additional info such as day of the week, day of the year etc... (year, month, yday, hour, min, sec, etc...) | |
# date from string: | |
d <- as.Date("2013-12-25") | |
# number of days since 1970: | |
daysSinceNow <- unclass(d) | |
[1] 16064 | |
# generic functions for dates and times: | |
weekdays(d) | |
[1] "mercredi" | |
months(d) | |
[1] "décembre" | |
quarters(d) | |
[1] "Q4" | |
# system time: | |
t <- Sys.time() | |
dt <- as.POSIXlt(t) | |
d$sec | |
Error in d$sec : $ operator is invalid for atomic vectors | |
dt$sec | |
[1] 51.66967 | |
data/time format: | |
strptime() converts one or several strings to time objects (see help for format) | |
date/time operations: | |
via standard functions: +, -, ==, >, etc... | |
+ conversion functions: as.Date, as.POSIXct, as.POSIXlt | |
✔ Regular Expressions (27:21) @done (14-01-28 18:39) | |
regexp = combination of literals (text) and meta-characters (ex: starts with, alternative, word boundary) | |
In R, a way to extract data from "unfriendly" sources (web sites, messy text files, etc...). | |
meta-characters: | |
^: line that starts with text (ex: "^foo") | |
$: line that ends with text (ex: "foo$") | |
[]: a set of characters (ex: [Nn] [Ii] [Cc] [Oo] to find 'nico', ignoring case) | |
can be used with ranges (ex: [0-9] or [a-z]) | |
[^]: negative set (ex: "[^?.]$" returns line that do NOT end with '?' or '.') | |
.: any character, or none | |
|: alternatives (ex: "dev|coder" will return lines containing either "dev" or "coder") | |
(: to indicates scope ("^foo|bar" which returns lines that start with "foo" or contain "bar" | |
vs "^(foo|bar) that return lines that start with "foo" or "bar") | |
to store matched text ("grouping") in \1, \2, etc... | |
?: for optional expressions (ex: "George( [Ww]\. )?" match "George W. Bush" and "George Bush") | |
\: escape meta-character (ex: "\." means period, not "any character" meta-char) | |
+: at least one of... | |
*: any of... (including none) | |
"greedy" => matches the longest possible string | |
greediness can be stopped via ? | |
{}: custom repetition: min-max interval or min or max | |
(ex: "[Bb]ush( +[^ ]+ +){1, 5}) debate" will match lines having 1 to 5 words between "Bush" and "debate") | |
✔ Introduction to Baltimore City Homicide Data (4:20) @done (14-01-28 18:44) | |
✔ Regular Expressions in R (30:08) @done (14-01-28 20:26) | |
grep()/grepl() search in a character vector, return index numbers that match or booleans | |
regexpr()/grepregexpr() similar but return index of the string where match begin + length of match | |
regexp is for first match, gregexp is for all matches | |
used in conjonction with regmatches() | |
sub()/gsub() search and replace (sub for first match, gsub for all matches) | |
regexec() gives indices for sub-expressions (with parentheses) | |
demo: | |
# Count lines with "shooting": | |
length(grep("[Ss]hooting", homicides)) | |
[1] 1005 | |
# Count lines with "cause: shooting": | |
length(grep("[Cc]ause: [Ss]hooting", homicides)) | |
[1] 1003 | |
# Troubleshootig differences: | |
s1 <- length(grep("[Ss]hooting", homicides)) | |
s2 <- length(grep("[Cc]ause: [Ss]hooting", homicides)) | |
setdiff(s1, s2) | |
[1] 1005 | |
s1 <- grep("[Ss]hooting", homicides) | |
s2 <- grep("[Cc]ause: [Ss]hooting", homicides) | |
setdiff(s1, s2) | |
[1] 318 859 | |
setdiff(s2, s1) | |
integer(0) | |
# state that start with "New": | |
grep("^New", state.name) # line numbers | |
[1] 29 30 31 32 | |
grep("^New", state.name, value=TRUE) # values | |
[1] "New Hampshire" "New Jersey" "New Mexico" "New York" | |
grepl("^New", state.name) # boolean vector | |
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE | |
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE | |
[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE | |
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE | |
[49] FALSE FALSE | |
# Extract dates (initial try): | |
> r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5]) | |
> regmatches(homicides[1:5], r) | |
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>" | |
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>" | |
[5] "<dd>Found on January 5, 2007</dd>" | |
# Extract dates (other try): | |
> r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5]) | |
> m <- regmatches(homicides[1:5], r) | |
> d <- gsub("<dd>Found on |</dd>", "", m) | |
> as.Date(d[1], "%B %d, %Y") | |
[1] NA | |
# NB: should work with US settings (instead of French) | |
> as.Date(d, "%B %d, %Y") | |
[1] NA NA NA NA NA | |
# Using sub-esxpressions via regexec: | |
r <- regexec("<dd>Found on (.*?)</dd>", homicides) | |
m <- regmatches(homicides, r) | |
dates <- sapply(m, function(x) x[2]) | |
dates[1] | |
[1] "January 1, 2007" | |
dates <- as.Date(dates, "%B %d, %Y") | |
# hist() can automatically handle dates | |
hist(dates, "month", freq=TRUE) | |
✔ Classes and Methods in R (34:51) @done (14-01-29 10:49) | |
S (and R!) supports OOP with a few specificities | |
2 systems: | |
- S3 classes and methods: informal, "old-style", easier, quick-and-dirty | |
- S4 classes and methods: more formal and rigorous, "new-style" | |
Separate systems, but can partially mix. | |
S4 style is in "methods" package (usually laoded by default). | |
Class definition: setClass() | |
Object creation: new() | |
Method: function that only operates on a certain class of objects | |
Generic function: function that only dispatches methods (ex: plot() dispatch to different functions depending on input data type) | |
help: ?Classes, ?Methods, ?setClass, ?setMethod, ?setGeneric | |
Generic functions: | |
S3: | |
# display "mean" function signature: | |
mean | |
function (x, ...) | |
UseMethod("mean") | |
<bytecode: 0x065e11dc> | |
<environment: namespace:base> | |
# dispatched methods: | |
methods("mean") | |
[1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt | |
# display code for default method: | |
getS3method("mean", "default") | |
function (x, trim = 0, na.rm = FALSE, ...) | |
{ | |
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { | |
# ... | |
S4: | |
# display "show" function signature: | |
> show | |
standardGeneric for "show" defined from package "methods" | |
function (object) | |
standardGeneric("show") | |
<bytecode: 0x052674c4> | |
<environment: 0x04f907a0> | |
Methods may be defined for arguments: object | |
Use showMethods("show") for currently available ones. | |
(This generic function excludes non-simple inheritance; see ?setIs) | |
# display "show" dispatch methods: | |
> head(showMethods("show")) | |
Function: show (package methods) | |
object="ANY" | |
object="classGeneratorFunction" | |
object="classRepresentation" | |
A generic function has at least one param (the object). | |
If dispatch is found, method is called. Else, default method is called if exists, otherwise an error is thrown. | |
S3 dispatch methods should not be called directly! | |
S4 dispatch methods cannot be called directly. | |
Write your own type: | |
Why? To represent custom model (e.g. gene expression) that does not have built-in type | |
Probably need to write methods for print()/show(), summary(), plot() | |
For new S4 type: | |
1. use setClass() to define: | |
- name of class | |
- data elements, "slots" | |
2. use setMethod() to define: | |
- methods | |
3. check class info: | |
via showClass() | |
Example: | |
# define polygon: | |
setClass("polygon", representation(x = "numeric", y = "numeric")) | |
# implement plot() for polygon, and register it (side-effect) for current session: | |
setMethod( | |
"plot", # generic function | |
"polygon", # class name | |
function (x, y, ...) { | |
# draw points via default method | |
plot(x@x, x@y, type="n", ...) | |
# draw lines | |
xp <- c(x@x, x@x[1]) | |
yp <- c(x@y, x@y[1]) | |
lines(xp, yp) | |
}) | |
# check that is has been registered: | |
showMethods("plot") | |
Function: plot (package graphics) | |
x="ANY" | |
x="polygon" | |
# create polygon object and call plot(): | |
p <- new("polygon", x = c(1,2,3,4), y = c(1,2,3,1)) | |
plot(p) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment