r-cheatsheet.md

#R cheatsheet

created while studying for the Machine Learning course at FIB at Universitat Politechnica de Catalunya (BarcelonaTech), this cheatsheet contains the knowledge taught during the first half this course.

R language

###Syntax

for loops:

for(a in 1:10) {
    print(a);
}

function definitions:

myFunction <- function(a, b, c = 0) {
  tmp <- a + b;
  tmp - c; #the value of the last line is automatically returned
}

###array handling

A vector can be created using c. A sequence of integers can be created using a:b

test.vector <- c("a", "b", "c", "d");
1:10 #equivalent to c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

A element of this vector can be accesed using square brackets. Elements are counted starting from 1. Negative number can be used to exclude elements. Vectors can be used as indices.

test.vector[1] #returns "a"
test.vector[2] #returns "b"
test.vector[-2] #returns "a" "c" "d"
test.vector[-2] #returns "a" "c" "d"
test.vector[c(1,3)] #returns "a" "c"
test.vector[-c(1,3)] #returns "b" "d"

An empty matrix can be created using matrix(data, nrow, ncol). Fields are accesed using square brackets or column names.

test.matrix <- matrix(NA, 5, 5);
colnames(test.matrix) <- c("col1", "col2", "col3", "result");
test.matrix[c(1,3), c(2,4)] #the submatrix composed of rows 1,3 and columns 2,4
test.matrix[,c("col1", "col2")] #the submatrix composed of all rows and columns 1 and 2
test.matrix$result #returns only the result column

##Basic function

###Data related functions

function	description
a:b	equivalent to seq(a, b)
seq(a,b,by)	creates a sequence of numbers between `a` and `b` using step size `by`
rep(x, times)	replicates `x` `times` times
matrix(data, nrow, ncol)	creates a new matrix
data.frame(...)	creates a data frame using all the parameters as columns
diag(matrix)	gets/sets the diagonal of this matrix
diag(vector)	returns a matrix with the given diagonal
diag(scalar)	return an identity matrix with the given size
diag(scalar, nrow)	returns a matrix with the given diagonal

###Inspection

function	description
dim(x)	returns the size in all dimensions
nrow(x)	returns the number of rows
ncol(x)	returns the number of columns
names(x)	gets/sets the names of an object
colnames(x)	gets/sets the names of an object
rownames(x)	gets/sets the names of an object
summary(x)	prints a summary

###Type handling

function	description
typeof(x)	gives informations about the type of a variable
as.integer(x)	casts to integer
as.double(x)	casts to double
as.factor(x)	casts to factor
as.ordered(x)	cast to ordered factor
as.character(x)	casts to string
as.data.frame(x)	casts to a data frame
is.*(x)	tests if x is of the specified type
levels(factor)	gets/sets the available levels of a factor
droplevels(factor)	drops unused levels of a factor
cut(x, breaks, labels = NULL)	creates a factor by cutting `x` into slices specified by `breaks`
unclass(x)	casts a factorial into an integer; returns the levels

###File handling

function	description
setwd(path)	sets the working directory
getwd()	gets the working directory
list.files(path = ".")	lists contents of a directory
read.csv(filename, ...)	reads a CSV file
save(data, file=...)	writes `data` into `file` (binary)
load(file)	loads a variable saved using `save`

###String operations

function	description
paste	concatenates strings
paste0	concatenates strings without separator

###Calculations

function	description
apply(x, margin, fun)	applies function `fun` to every column or row of `x`
sum(x)	calculates the sum of a vector/matrix
mean(x)	calculates the mean of a vector/matrix
median(x)	calculates medians
quantile(x, probs = seq(0,1,0.25))	returns the quantiles of a variable
range(x)	returns the range of a continous function
cor(x)	calculate the correlation matrix within a data frame
choose(x)	calculate binomial numbers
ginv(x)	calculates the Moore-Penrose generalized inverse. library: `MASS`

###Other useful functions

function	description
which(x)	returns the indices of the TRUE values in a boolean vector
replicate(n, fun)	executes `fun` `n` times and returns the return values composed as a vector
rm(x)	deletes a variables
ls()	enumerates all defined variables

Remove all variables: rm(list = ls())

##Handling missing data

function	description
na.omit(x)	removes rows with NAs from a data frame
addNA(x)	for factorials: add NA as a new level

##Statistics

###Random Numbers

function	description
set.seed(x)	sets the random seed
runif(n, min = 0, max = 1)	samples the uniform distribution
rbinom(n, size, prob)	samples the binomial distribution
rnorm(N, mean, sd)	samples the normal distribution
rmvnorm(N, mu, sigma)	samples a multivariate normal distribution
rpois(n, lamba)	sample the poisson distribution
sample(vector, N)	draws `N` samples from `vector`

###Tests

function	description
chisq.test(x,y)	performs a Chi-square test

##Plotting

function	description
plot(x)	plots x (in a hopefully useful way)
hist(x)	prints a histogram
boxplot(x)	creates a box plot
barplot(x)	creates a bar plot
pie(x)	creates a pie chart
pairs(x)	draws pairwise scatter-plots of all variables
abline(h=...)	creates a horizontal line
abline(v=..., lty="dashed")	creates a dashed vertical line
curve(f)	draws a function
title(new.title)	sets the title of a plot
text(x, y=NULL, labels)	adds text to a plot
legend	adds a legend
graphics.off()	reset/close all graphic devices
par(...)	sets graphical parameters

graphical parameters:

lty: line type
col: color blue, red, ..., #ffcc00 (hex. RGB)
bg: background color
cex: font size
axes.cex: font size on axes
xlog, ylog: use logarithmic axes
xlab, ylab: label of x-/y-axis: parallel(0), horizontal(1), perpendicular(2), vertical(3)
las: orientation of graphic labels ß
mfrow=c(x, y): subdivides area for multiple plots

Plot histogram together with normal estimation:

hist.with.normal <- function (x, xlabel=deparse(substitute(x)), ...)
{
  h <- hist(x,plot=F, ...)
  s <- sd(x)
  m <- mean(x)
  ylim <- range(0,h$density,dnorm(0,sd=s))
  hist(x,freq=F,ylim=ylim,xlab=xlabel, main="", ...)
  curve(dnorm(x,m,s),add=T)
}

##Clustering

library: cclust

function	description
cclust(coordiates,K,iter.max=100,method="kmeans",dist="euclidean")	performs a cclust clustering
clustIndex(clustering, coordinates, index="calinski")	calculates Calinski index for a clustering obtained using `cclust`

##Model fitting

function	description
table	creates a cross table
prop.table	creates a probability cross table

###LDA, QDA

function	description
lsfit(x,y)	linear least squares fit `y=a*x + b`
lda(x, grouping, priors=... CV=...)	performs LDA; `x`: input data, `grouping`: group assignments `CV`: if true LOOCV is performed and returns the predictions instead of the model. library: `MASS`
qda(x, grouping, priors=... CV=...)	the same parameters as lda. library: `MASS`
partimat(x, grouping, method)	applies LDA/QDA in pairs of dimensions. Plots the decision regions. library: `klaR`

draw the linear regression line:

abline(lsfit(x,y))

LDA example:

data(crabs)
lda.model <- lda (x=Crabs, grouping=Crabs.class)
lda.model #we can directly inspect the model
plot(lda.model) #we can plot it
#project the data into the new space
loadings <- as.matrix(Crabs) %*% as.matrix(lda.model$scaling)
ct <- table(Crabs.class, predict(lda.model, Crabs)$class)
sum(diag(prop.table(ct))) #total percent correct
prediction <- predict(lda.model, newdata=...)
prediction$class #the predicted classes
prediction$posterior #the posteriors
prediction$x #the dicriminants

###Knn

function	description
knn(inputs, classes, k=1)	performs a k nearest neighboor imputation. read the note below! package: `class`
knn.cv (inputs, classes, k=1)	performs a LOOCV for Knn

knn returns the imputed values as factors. Direct casting to numbers does not work correctly.
Instead, the results must be casted to strings first:

results <- as.double(as.character(knn(train, test)))

###Naive Bayes

library(e1071)
model <- naiveBayes(Class ~ ., data=..., laplace=...)
predict(model, newdata)
predict(model, newdata, type = "raw")

###(Generalized) Linear Models

function	description
lm(formula)	fits a linear model
lm.ridge(formula, lambda=...)	fit a linear model using ridge regression. library: `MASS`
glm(formula, family=...)	fits a generalized linear model
glm(formula, family=binomial(link=logit)	performs a logistic regression
glm(formula, family=poisson(link=log)	performs a logistic regression
glm(formula, family=gaussian	performs a logistic regression

An additional data is necessary if the variables in the formula are not already defined.

ridge regression

select(lm.ridge(...))

GLM Example:

glm.res <- glm (y~x+y+z, family = binomial(link=logit))
summary(glm.res) #shows more informations than just "glm.res"
glm.res$coefficients #access the coefficients
exp(glm.res$coefficients["x"]) #how much do the odds change by x=x+1?
exp(Admis.logreg$coefficients) #same for all coefficients
ord <- predict(glm.res, data.frame(x=newx, y=newy, z=newz),type="response") #returns the logodds
step(glm.res) #tries to simplify  the modelby removing least important variable

vogelsgesang/r-cheatsheet.md

R language