#R cheatsheet
created while studying for the Machine Learning course at FIB at Universitat Politechnica de Catalunya (BarcelonaTech), this cheatsheet contains the knowledge taught during the first half this course.
###Syntax
for loops:
for(a in 1:10) {
print(a);
}
function definitions:
myFunction <- function(a, b, c = 0) {
tmp <- a + b;
tmp - c; #the value of the last line is automatically returned
}
###array handling
A vector can be created using c
. A sequence of integers can be created using a:b
test.vector <- c("a", "b", "c", "d");
1:10 #equivalent to c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
A element of this vector can be accesed using square brackets. Elements are counted starting from 1. Negative number can be used to exclude elements. Vectors can be used as indices.
test.vector[1] #returns "a"
test.vector[2] #returns "b"
test.vector[-2] #returns "a" "c" "d"
test.vector[-2] #returns "a" "c" "d"
test.vector[c(1,3)] #returns "a" "c"
test.vector[-c(1,3)] #returns "b" "d"
An empty matrix can be created using matrix(data, nrow, ncol)
.
Fields are accesed using square brackets or column names.
test.matrix <- matrix(NA, 5, 5);
colnames(test.matrix) <- c("col1", "col2", "col3", "result");
test.matrix[c(1,3), c(2,4)] #the submatrix composed of rows 1,3 and columns 2,4
test.matrix[,c("col1", "col2")] #the submatrix composed of all rows and columns 1 and 2
test.matrix$result #returns only the result column
##Basic function
###Data related functions
function | description |
---|---|
a:b | equivalent to seq(a, b) |
seq(a,b,by) | creates a sequence of numbers between a and b using step size by |
rep(x, times) | replicates x times times |
matrix(data, nrow, ncol) | creates a new matrix |
data.frame(...) | creates a data frame using all the parameters as columns |
diag(matrix) | gets/sets the diagonal of this matrix |
diag(vector) | returns a matrix with the given diagonal |
diag(scalar) | return an identity matrix with the given size |
diag(scalar, nrow) | returns a matrix with the given diagonal |
###Inspection
function | description |
---|---|
dim(x) | returns the size in all dimensions |
nrow(x) | returns the number of rows |
ncol(x) | returns the number of columns |
names(x) | gets/sets the names of an object |
colnames(x) | gets/sets the names of an object |
rownames(x) | gets/sets the names of an object |
summary(x) | prints a summary |
###Type handling
function | description |
---|---|
typeof(x) | gives informations about the type of a variable |
as.integer(x) | casts to integer |
as.double(x) | casts to double |
as.factor(x) | casts to factor |
as.ordered(x) | cast to ordered factor |
as.character(x) | casts to string |
as.data.frame(x) | casts to a data frame |
is.*(x) | tests if x is of the specified type |
levels(factor) | gets/sets the available levels of a factor |
droplevels(factor) | drops unused levels of a factor |
cut(x, breaks, labels = NULL) | creates a factor by cutting x into slices specified by breaks |
unclass(x) | casts a factorial into an integer; returns the levels |
###File handling
function | description |
---|---|
setwd(path) | sets the working directory |
getwd() | gets the working directory |
list.files(path = ".") | lists contents of a directory |
read.csv(filename, ...) | reads a CSV file |
save(data, file=...) | writes data into file (binary) |
load(file) | loads a variable saved using save |
###String operations
function | description |
---|---|
paste | concatenates strings |
paste0 | concatenates strings without separator |
###Calculations
function | description |
---|---|
apply(x, margin, fun) | applies function fun to every column or row of x |
sum(x) | calculates the sum of a vector/matrix |
mean(x) | calculates the mean of a vector/matrix |
median(x) | calculates medians |
quantile(x, probs = seq(0,1,0.25)) | returns the quantiles of a variable |
range(x) | returns the range of a continous function |
cor(x) | calculate the correlation matrix within a data frame |
choose(x) | calculate binomial numbers |
ginv(x) | calculates the Moore-Penrose generalized inverse. library: MASS |
###Other useful functions
function | description |
---|---|
which(x) | returns the indices of the TRUE values in a boolean vector |
replicate(n, fun) | executes fun n times and returns the return values composed as a vector |
rm(x) | deletes a variables |
ls() | enumerates all defined variables |
Remove all variables: rm(list = ls())
##Handling missing data
function | description |
---|---|
na.omit(x) | removes rows with NAs from a data frame |
addNA(x) | for factorials: add NA as a new level |
##Statistics
###Random Numbers
function | description |
---|---|
set.seed(x) | sets the random seed |
runif(n, min = 0, max = 1) | samples the uniform distribution |
rbinom(n, size, prob) | samples the binomial distribution |
rnorm(N, mean, sd) | samples the normal distribution |
rmvnorm(N, mu, sigma) | samples a multivariate normal distribution |
rpois(n, lamba) | sample the poisson distribution |
sample(vector, N) | draws N samples from vector |
###Tests
function | description |
---|---|
chisq.test(x,y) | performs a Chi-square test |
##Plotting
function | description |
---|---|
plot(x) | plots x (in a hopefully useful way) |
hist(x) | prints a histogram |
boxplot(x) | creates a box plot |
barplot(x) | creates a bar plot |
pie(x) | creates a pie chart |
pairs(x) | draws pairwise scatter-plots of all variables |
abline(h=...) | creates a horizontal line |
abline(v=..., lty="dashed") | creates a dashed vertical line |
curve(f) | draws a function |
title(new.title) | sets the title of a plot |
text(x, y=NULL, labels) | adds text to a plot |
legend | adds a legend |
graphics.off() | reset/close all graphic devices |
par(...) | sets graphical parameters |
graphical parameters:
lty
: line typecol
: colorblue
,red
, ...,#ffcc00
(hex. RGB)bg
: background colorcex
: font sizeaxes.cex
: font size on axesxlog
,ylog
: use logarithmic axesxlab
,ylab
: label of x-/y-axis: parallel(0), horizontal(1), perpendicular(2), vertical(3)las
: orientation of graphic labels ßmfrow=c(x, y)
: subdivides area for multiple plots
Plot histogram together with normal estimation:
hist.with.normal <- function (x, xlabel=deparse(substitute(x)), ...)
{
h <- hist(x,plot=F, ...)
s <- sd(x)
m <- mean(x)
ylim <- range(0,h$density,dnorm(0,sd=s))
hist(x,freq=F,ylim=ylim,xlab=xlabel, main="", ...)
curve(dnorm(x,m,s),add=T)
}
##Clustering
library: cclust
function | description |
---|---|
cclust(coordiates,K,iter.max=100,method="kmeans",dist="euclidean") | performs a cclust clustering |
clustIndex(clustering, coordinates, index="calinski") | calculates Calinski index for a clustering obtained using cclust |
##Model fitting
function | description |
---|---|
table | creates a cross table |
prop.table | creates a probability cross table |
###LDA, QDA
function | description |
---|---|
lsfit(x,y) | linear least squares fit y=a*x + b |
lda(x, grouping, priors=... CV=...) | performs LDA; x : input data, grouping : group assignments CV : if true LOOCV is performed and returns the predictions instead of the model.library: MASS |
qda(x, grouping, priors=... CV=...) | the same parameters as lda. library: MASS |
partimat(x, grouping, method) | applies LDA/QDA in pairs of dimensions. Plots the decision regions. library: klaR |
draw the linear regression line:
abline(lsfit(x,y))
LDA example:
data(crabs)
lda.model <- lda (x=Crabs, grouping=Crabs.class)
lda.model #we can directly inspect the model
plot(lda.model) #we can plot it
#project the data into the new space
loadings <- as.matrix(Crabs) %*% as.matrix(lda.model$scaling)
ct <- table(Crabs.class, predict(lda.model, Crabs)$class)
sum(diag(prop.table(ct))) #total percent correct
prediction <- predict(lda.model, newdata=...)
prediction$class #the predicted classes
prediction$posterior #the posteriors
prediction$x #the dicriminants
###Knn
function | description |
---|---|
knn(inputs, classes, k=1) | performs a k nearest neighboor imputation. read the note below! package: class |
knn.cv (inputs, classes, k=1) | performs a LOOCV for Knn |
knn returns the imputed values as factors. Direct casting to numbers does not work correctly.
Instead, the results must be casted to strings first:
results <- as.double(as.character(knn(train, test)))
###Naive Bayes
library(e1071)
model <- naiveBayes(Class ~ ., data=..., laplace=...)
predict(model, newdata)
predict(model, newdata, type = "raw")
###(Generalized) Linear Models
function | description |
---|---|
lm(formula) | fits a linear model |
lm.ridge(formula, lambda=...) | fit a linear model using ridge regression. library: MASS |
glm(formula, family=...) | fits a generalized linear model |
glm(formula, family=binomial(link=logit) | performs a logistic regression |
glm(formula, family=poisson(link=log) | performs a logistic regression |
glm(formula, family=gaussian | performs a logistic regression |
An additional data
is necessary if the variables in the formula are not already defined.
ridge regression
select(lm.ridge(...))
GLM Example:
glm.res <- glm (y~x+y+z, family = binomial(link=logit))
summary(glm.res) #shows more informations than just "glm.res"
glm.res$coefficients #access the coefficients
exp(glm.res$coefficients["x"]) #how much do the odds change by x=x+1?
exp(Admis.logreg$coefficients) #same for all coefficients
ord <- predict(glm.res, data.frame(x=newx, y=newy, z=newz),type="response") #returns the logodds
step(glm.res) #tries to simplify the modelby removing least important variable