Skip to content

Instantly share code, notes, and snippets.

@vogelsgesang
Last active August 29, 2015 14:02
Show Gist options
  • Save vogelsgesang/f194ecfb5ba374c3e94d to your computer and use it in GitHub Desktop.
Save vogelsgesang/f194ecfb5ba374c3e94d to your computer and use it in GitHub Desktop.

#R cheatsheet

created while studying for the Machine Learning course at FIB at Universitat Politechnica de Catalunya (BarcelonaTech), this cheatsheet contains the knowledge taught during the first half this course.

R language

###Syntax

for loops:

for(a in 1:10) {
    print(a);
}

function definitions:

myFunction <- function(a, b, c = 0) {
  tmp <- a + b;
  tmp - c; #the value of the last line is automatically returned
}

###array handling

A vector can be created using c. A sequence of integers can be created using a:b

test.vector <- c("a", "b", "c", "d");
1:10 #equivalent to c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

A element of this vector can be accesed using square brackets. Elements are counted starting from 1. Negative number can be used to exclude elements. Vectors can be used as indices.

test.vector[1] #returns "a"
test.vector[2] #returns "b"
test.vector[-2] #returns "a" "c" "d"
test.vector[-2] #returns "a" "c" "d"
test.vector[c(1,3)] #returns "a" "c"
test.vector[-c(1,3)] #returns "b" "d"

An empty matrix can be created using matrix(data, nrow, ncol). Fields are accesed using square brackets or column names.

test.matrix <- matrix(NA, 5, 5);
colnames(test.matrix) <- c("col1", "col2", "col3", "result");
test.matrix[c(1,3), c(2,4)] #the submatrix composed of rows 1,3 and columns 2,4
test.matrix[,c("col1", "col2")] #the submatrix composed of all rows and columns 1 and 2
test.matrix$result #returns only the result column

##Basic function

###Data related functions

function description
a:b equivalent to seq(a, b)
seq(a,b,by) creates a sequence of numbers between a and b using step size by
rep(x, times) replicates x times times
matrix(data, nrow, ncol) creates a new matrix
data.frame(...) creates a data frame using all the parameters as columns
diag(matrix) gets/sets the diagonal of this matrix
diag(vector) returns a matrix with the given diagonal
diag(scalar) return an identity matrix with the given size
diag(scalar, nrow) returns a matrix with the given diagonal

###Inspection

function description
dim(x) returns the size in all dimensions
nrow(x) returns the number of rows
ncol(x) returns the number of columns
names(x) gets/sets the names of an object
colnames(x) gets/sets the names of an object
rownames(x) gets/sets the names of an object
summary(x) prints a summary

###Type handling

function description
typeof(x) gives informations about the type of a variable
as.integer(x) casts to integer
as.double(x) casts to double
as.factor(x) casts to factor
as.ordered(x) cast to ordered factor
as.character(x) casts to string
as.data.frame(x) casts to a data frame
is.*(x) tests if x is of the specified type
levels(factor) gets/sets the available levels of a factor
droplevels(factor) drops unused levels of a factor
cut(x, breaks, labels = NULL) creates a factor by cutting x into slices specified by breaks
unclass(x) casts a factorial into an integer; returns the levels

###File handling

function description
setwd(path) sets the working directory
getwd() gets the working directory
list.files(path = ".") lists contents of a directory
read.csv(filename, ...) reads a CSV file
save(data, file=...) writes data into file (binary)
load(file) loads a variable saved using save

###String operations

function description
paste concatenates strings
paste0 concatenates strings without separator

###Calculations

function description
apply(x, margin, fun) applies function fun to every column or row of x
sum(x) calculates the sum of a vector/matrix
mean(x) calculates the mean of a vector/matrix
median(x) calculates medians
quantile(x, probs = seq(0,1,0.25)) returns the quantiles of a variable
range(x) returns the range of a continous function
cor(x) calculate the correlation matrix within a data frame
choose(x) calculate binomial numbers
ginv(x) calculates the Moore-Penrose generalized inverse. library: MASS

###Other useful functions

function description
which(x) returns the indices of the TRUE values in a boolean vector
replicate(n, fun) executes fun n times and returns the return values composed as a vector
rm(x) deletes a variables
ls() enumerates all defined variables

Remove all variables: rm(list = ls())

##Handling missing data

function description
na.omit(x) removes rows with NAs from a data frame
addNA(x) for factorials: add NA as a new level

##Statistics

###Random Numbers

function description
set.seed(x) sets the random seed
runif(n, min = 0, max = 1) samples the uniform distribution
rbinom(n, size, prob) samples the binomial distribution
rnorm(N, mean, sd) samples the normal distribution
rmvnorm(N, mu, sigma) samples a multivariate normal distribution
rpois(n, lamba) sample the poisson distribution
sample(vector, N) draws N samples from vector

###Tests

function description
chisq.test(x,y) performs a Chi-square test

##Plotting

function description
plot(x) plots x (in a hopefully useful way)
hist(x) prints a histogram
boxplot(x) creates a box plot
barplot(x) creates a bar plot
pie(x) creates a pie chart
pairs(x) draws pairwise scatter-plots of all variables
abline(h=...) creates a horizontal line
abline(v=..., lty="dashed") creates a dashed vertical line
curve(f) draws a function
title(new.title) sets the title of a plot
text(x, y=NULL, labels) adds text to a plot
legend adds a legend
graphics.off() reset/close all graphic devices
par(...) sets graphical parameters

graphical parameters:

  • lty: line type
  • col: color blue, red, ..., #ffcc00 (hex. RGB)
  • bg: background color
  • cex: font size
  • axes.cex: font size on axes
  • xlog, ylog: use logarithmic axes
  • xlab, ylab: label of x-/y-axis: parallel(0), horizontal(1), perpendicular(2), vertical(3)
  • las: orientation of graphic labels ß
  • mfrow=c(x, y): subdivides area for multiple plots

Plot histogram together with normal estimation:

hist.with.normal <- function (x, xlabel=deparse(substitute(x)), ...)
{
  h <- hist(x,plot=F, ...)
  s <- sd(x)
  m <- mean(x)
  ylim <- range(0,h$density,dnorm(0,sd=s))
  hist(x,freq=F,ylim=ylim,xlab=xlabel, main="", ...)
  curve(dnorm(x,m,s),add=T)
}

##Clustering

library: cclust

function description
cclust(coordiates,K,iter.max=100,method="kmeans",dist="euclidean") performs a cclust clustering
clustIndex(clustering, coordinates, index="calinski") calculates Calinski index for a clustering obtained using cclust

##Model fitting

function description
table creates a cross table
prop.table creates a probability cross table

###LDA, QDA

function description
lsfit(x,y) linear least squares fit y=a*x + b
lda(x, grouping, priors=... CV=...) performs LDA; x: input data, grouping: group assignments
CV: if true LOOCV is performed and returns the predictions instead of the model.
library: MASS
qda(x, grouping, priors=... CV=...) the same parameters as lda. library: MASS
partimat(x, grouping, method) applies LDA/QDA in pairs of dimensions. Plots the decision regions. library: klaR

draw the linear regression line:

abline(lsfit(x,y))

LDA example:

data(crabs)
lda.model <- lda (x=Crabs, grouping=Crabs.class)
lda.model #we can directly inspect the model
plot(lda.model) #we can plot it
#project the data into the new space
loadings <- as.matrix(Crabs) %*% as.matrix(lda.model$scaling)
ct <- table(Crabs.class, predict(lda.model, Crabs)$class)
sum(diag(prop.table(ct))) #total percent correct
prediction <- predict(lda.model, newdata=...)
prediction$class #the predicted classes
prediction$posterior #the posteriors
prediction$x #the dicriminants

###Knn

function description
knn(inputs, classes, k=1) performs a k nearest neighboor imputation. read the note below! package: class
knn.cv (inputs, classes, k=1) performs a LOOCV for Knn

knn returns the imputed values as factors. Direct casting to numbers does not work correctly.
Instead, the results must be casted to strings first:

results <- as.double(as.character(knn(train, test)))

###Naive Bayes

library(e1071)
model <- naiveBayes(Class ~ ., data=..., laplace=...)
predict(model, newdata)
predict(model, newdata, type = "raw") 

###(Generalized) Linear Models

function description
lm(formula) fits a linear model
lm.ridge(formula, lambda=...) fit a linear model using ridge regression. library: MASS
glm(formula, family=...) fits a generalized linear model
glm(formula, family=binomial(link=logit) performs a logistic regression
glm(formula, family=poisson(link=log) performs a logistic regression
glm(formula, family=gaussian performs a logistic regression

An additional data is necessary if the variables in the formula are not already defined.

ridge regression

select(lm.ridge(...))

GLM Example:

glm.res <- glm (y~x+y+z, family = binomial(link=logit))
summary(glm.res) #shows more informations than just "glm.res"
glm.res$coefficients #access the coefficients
exp(glm.res$coefficients["x"]) #how much do the odds change by x=x+1?
exp(Admis.logreg$coefficients) #same for all coefficients
ord <- predict(glm.res, data.frame(x=newx, y=newy, z=newz),type="response") #returns the logodds
step(glm.res) #tries to simplify  the modelby removing least important variable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment