Skip to content

Instantly share code, notes, and snippets.

@Shoeboxam
Created April 24, 2019 16:34
Show Gist options
  • Select an option

  • Save Shoeboxam/458a7dd741ca205e73db67571cfeefdf to your computer and use it in GitHub Desktop.

Select an option

Save Shoeboxam/458a7dd741ca205e73db67571cfeefdf to your computer and use it in GitHub Desktop.
---
title: "Statistical/Machine Learning Mini-Project VI"
author: "Michael Shoemate"
date: "April 23, 2019"
output: pdf_document
---
# Section 1
```{R echo = F, results = 'hide', message=F, warning=F}
# installs each package if not exists, and attaches
import <- function(packages) {
log <- lapply(packages, function(package) {
if (inherits(try(library(package, character.only = TRUE)), 'try-error')) {
install.packages(package, dependencies = TRUE)
library(package, character.only = TRUE)
}
})
}
import(c("ggplot2", "gridExtra", "reshape2", "dplyr",
"lmtest", "xtable", "MASS", "lattice", "caret", "pROC", "boot", "broom",
"ISLR", "leaps", "glmnet", "ElemStatLearn", "glmnetUtils"))
options(xtable.comment = FALSE, xtable.floating=FALSE)
options(warn=-1)
printStatistics <- function(yTrue, yPred, positiveLabel=NULL) {
confWrapper <- caret::confusionMatrix(yPred, yTrue, positive=positiveLabel)
cat(' ')
cat('Confusion Matrix (predicted on y, actual on x) ')
print(xtable(confWrapper$table, digits=4))
statistics <- data.frame(
Sensitivity=confWrapper$byClass['Sensitivity'],
Specificity=confWrapper$byClass['Specificity'],
MissRate=1 - confWrapper$byClass['Sensitivity'],
ErrorRate=1 - confWrapper$overall['Accuracy']
)
rownames(statistics) <- NULL
cat('Summary Statistics ')
print(xtable(statistics, digits=4))
}
printWrapperStatistics <- function(trainWrapper, positiveLabel) {
labels <- unique(trainWrapper$trainingData$.outcome)
yTrue <- as.numeric(trainWrapper$trainingData$.outcome == positiveLabel)
yTrue <- factor(yTrue, levels = c(0, 1), labels = labels)
yProb <- aggregate(formula(paste(positiveLabel, '~rowIndex')),
trainWrapper$pred, mean)[,positiveLabel]
yPred <- factor(yProb > .5, levels = c(F, T), labels=labels)
printStatistics(yTrue, yPred)
}
```
## 1.a Data Exploration
By inspecting the data, Manufacturer and Group are the same, so one may be dropped.
```{R echo=F, message=F, results='asis'}
##### Question 1
# 1.a
cereal <- read.csv('./data/cereal.csv')
xtable(summary(cereal)[,1:4])
cat(' ')
xtable(summary(cereal)[,5:8])
cat(' ')
xtable(summary(cereal)[,9:11])
colsClust <- c("Manufacturer", "Calories", "Protein", "Fat", "Sodium",
"Fiber", "Carbohydrates", "Sugar", "Potassium")
colsQuant <- c("Calories", "Protein", "Fat", "Sodium", "Fiber",
"Carbohydrates", "Sugar", "Potassium")
cerealClust <- cereal[,colsQuant]
cat(" \n")
cat("Correlation matrix among quantitative predictors: ")
xtable(cor(cerealClust))
cat(" \n")
cat("Selected variable pair plots: ")
cat(" \n")
pairs(~Manufacturer+Calories+Fat+Carbohydrates, data=cereal)
```
Fiber and potassium have significant correlation, but most other variables are relatively independent.
## 1.b Standardizing
Variables should typically be standardized before distance-based clustering. The scale of a variable acts as the importance or weight of the variable. By standardizing the data, the importance given to each variable is equivalent. Variables do not need to be standardized for correlation-based clustering, because the process of computing the correlation includes a standardization.
## 1.c Method
In this situation, distance based metrics make more sense. I care less about detecting pattern profiles among cereals devoid of scale, and more about finding groups with similar nutritional statistics.
## 1.d Hierarchical Clustering
There is a distance metric for mixed variable types, via the daisy gower metric, but I left it commented because the question asks explicitly for euclidean distance. Considering this, I just removed the categorical variables `Manufacturer` and `Group`. Another option would be to to apply a onehot expansion to the qualitative variables, but this has the downside of increasing the dimensionality. The visualization is a cluster dendrogram based on standardized euclidean distance, with complete link hierarchical clustering.
```{R echo=F, message=F, results='asis'}
# 1.d
# plot(stats::hclust(cluster::daisy(cerealClust, stand=T), method="complete"))
# plot(stats::hclust(as.dist(1 - abs(cor(t(cerealClust)))), method="complete"))
plot(stats::hclust(dist(cerealClust), method="complete"), xlab='Observation index')
```
Visually, I consider observation 18 an outlier, and would break the cereals into 3 or 4 groups. The center group is could potentially be split to make the fourth group.
## Question 1.e K-Means
K-Means is fit with different values of K. Each value of K is compared via $between SS / total SS$.
```{R echo=F, message=F, results='asis'}
# 1.e
K_sizes <- 2:4
K_performances <- sapply(K_sizes, function(K) {
seeds <- cereal[sample(1:nrow(cereal), K),colsQuant]
fit <- stats::kmeans(cereal[,colsQuant], seeds)
fit$betweenss / fit$totss
})
plot(K_sizes, K_performances)
```
Among these three values of K, I would choose four clusters, because the bend in performance compared to two and four is negligible, and the cluster accuracy improves. Given that this is unsupervised and incredibly subjective, I could very well argue for three instead (i.e. the bend is important!).
## Question 1.d Comparison
Again, I could make a case for either clustering scheme. I'm going to recommend K-Means, because the centroids are useful for characterizing the clusters. On the other hand, with the hierarchical clustering, the dendrogram is useful for visualizing how close individual cereals are to their groups.
## Question 2.a Decision Tree
A decision tree is fit to the data.
```{R echo=F, message=F, results='asis'}
##### Question 2
analyze <- function(y, x, data, method, hyperparameters=list()) {
# This performs model training over k folds, as well as grid searches
do.call(caret::train, c(list(
formula(paste(y, '~', paste(x, collapse='+'))),
data=data,
method=method,
trControl=caret::trainControl(
# method = "cv",
number=2,
# repeats=1,
savePredictions = TRUE,
classProbs = TRUE)
), hyperparameters))
}
test <- head(ISLR::Caravan, 1000)
train <- head(ISLR::Caravan, -1000)
```
The regions in the discovered decision tree, in left to right order, are:
1. "No" if PPERSAUT is less than six.
2. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is less than one.
3. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is greater than or equal to two.
4. "Yes" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is less than two. This is the only case where an observation is predicted "Yes". This can often occur in an unbalanced dataset or a dataset with low variance in a class.
5. "No" if PPERSAUT is greater than six and MOSTYPE is less than nine.
The branches in the tree exist to isolate a hyper-rectangle in the predictor space that is labeled "Yes".
```{R echo=F, message=F, results='asis'}
# 2.a
fit <- rpart::rpart(Purchase ~ ., train,
control=rpart::rpart.control(minsplit=10, cp=0.005))
rpart.plot::rpart.plot(fit)
printStatistics(test$Purchase,
predict(fit, test, type="class"), positiveLabel = "Yes")
cat(' \n')
```
## Question 2.a Pruning
```{R echo=F, message=F, results='asis'}
# 2.b
cvGrid <- seq(0., .01, .0005)
cvPerformances <- sapply(cvGrid, function(cp) {
pruned <- rpart::prune.rpart(fit, cp=cp)
mean(test$Purchase == predict(pruned, test, type="class"))
})
plot(cvGrid, cvPerformances,
main="CV performance for levels of CP pruning",
xlab="CP", ylab="Accuracy")
optimalID <- tail(which(cvPerformances == max(cvPerformances)), n=1)
optimalCP <- cvGrid[optimalID]
optimalPerf <- cvPerformances[optimalID]
cat(' \n')
cat(paste0('The optimal cp for pruning is ', optimalCP,
' with an error of ', optimalPerf, '.'))
fitPruned <- rpart::prune.rpart(fit, cp=optimalCP)
printStatistics(test$Purchase, predict(fitPruned, test, type="class"),
positiveLabel = "Yes")
```
The pruned and unpruned trees seem to be the same.
## Question 2.c Bagged Decision Tree
```{R echo=F, message=F, results='asis'}
# 2.c
B <- 1000
# my computer is dying with B=1000 on this model
fit <- analyze('Purchase', c('.'), train, 'treebag',
hyperparameters=list(nbagg=B/5))
cat(' ')
cat("The top 10 most important variables are shown: ")
cat(' \n')
plot(caret::varImp(fit), top=10)
printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")
```
## Question 2.d Random Forest
Using the ranger library, `m`, the number of variables to split at each node is by default $\sqrt{p}$.
```{R echo=F, message=F, results='asis'}
# 2.d
fit <- analyze('Purchase', c('.'), train, 'ranger',
hyperparameters=list(num.trees=B))
printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")
```
## Question 2.e Boosted Decision Tree
```{R echo=F, message=F, results='hide'}
# 2.d
fit <- analyze('Purchase', c('.'), train, 'gbm', hyperparameters=list(
tuneGrid=expand.grid(
n.trees=B,
interaction.depth=1,
shrinkage=.01,
n.minobsinnode=10
)))
```
```{R echo=F, message=F, results='asis'}
printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")
```
## Question 2.f KNN
```{R echo=F, message=F, results='asis'}
# 2.f
fit <- analyze('Purchase', c('.'), train, 'knn',
hyperparameters = list(tuneGrid=expand.grid(k=1:30)))
cat(' \n')
cat(paste('The optimal value of K is:', fit$results$K))
cat(' \n')
printStatistics(test$Purchase, predict(fit, test, type="raw"), positiveLabel="Yes")
```
## Question 2.g Lasso Logistic Regression
```{R echo=F, message=F, results='asis'}
# 2.g
xTrain <- as.matrix(train[,!(colnames(train) == "Purchase")])[,-1]
yTrain <- ifelse(train$Purchase == "Yes", 1, 0)
xTest <- as.matrix(test[,!(colnames(test) == "Purchase")])[,-1]
grid <- 10^seq(1, -2, length=100)
cvTrainModel <- function(alpha) {
accuracies <- sapply(grid, function(lambda) {
fitted <- glmnet(xTrain, yTrain, family="binomial",
alpha=alpha, lambda=lambda, maxit = 500)
predLabels <- factor(predict(fitted, xTest, type="response") > .2,
levels=c(F, T), labels=c("No", "Yes"))
mean(test$Purchase == predLabels)
})
bestLambda <- grid[[first(which(accuracies == max(accuracies)))]]
cat(paste0('The best value of lambda is ', round(bestLambda, 4), '.'))
model <- glmnet(xTrain, yTrain, family="binomial",
alpha=alpha, lambda=bestLambda, maxit = 500)
print(xtable(as.data.frame(
as.matrix(coef(model))[rowSums(as.matrix(coef(model))) > 0,])))
predictions <- factor(
predict(model, xTest, type="response") > .2,
levels=c(F, T), labels=c("No", "Yes"))
printStatistics(test$Purchase, predictions)
}
cvTrainModel(1)
```
## Question 2.h Comparison
The overall best model is definitely the bagged decision tree. Most models struggled with the imbalanced class sizes, but the bagged decision tree maintains 86\% sensitivity and nearly 100\% specificity. The error rate is around 1\%, which is significantly better than all of the other models, which tend to hover around 6\%.
## Question 3.a Custom Implementation
I wrote my own decision tree implementation based on the math in the book and the toy example. The results are similar, but there are some differences deeper into the tree. I spent some considerable time testing and debugging, and am convinced my math is correct. I'm not sure how to account for the differences.
```{R echo=F, message=F, results='asis'}
### QUESTION 3/4 Tree Implementation
get_entropy <- function(y) {
count <- sapply(unique(y), function(level) sum(y == level))
probability <- count / length(y)
-sum(probability * log(probability))
}
get_sse <- function(y) sum((y - mean(y))^2) / length(y) / 2
fitTree <- function(X, y, control=list(mincut=5, minsize=10, mindev=.01),
rootdev=NA, prefix="") {
# choice of metric dependent on regression vs. classification tree
metric <- if (is.factor(y)) get_entropy else get_sse
# base cases
fitness <- metric(y)
if (is.na(rootdev)) rootdev = fitness
if (!is.finite(fitness) || fitness < control$mindev * rootdev) return("A")
if (length(y) < control$minsize) return("B")
find_best_split <- function(x) {
partitioner <- if (is.factor(x))
function(split) c(left=list(y[x == split]), right=list(y[x != split]))
else
function(split) c(left=list(y[x < split]), right=list(y[x > split]))
# can split on any level of x
splits <- sort(unique(x))
# if quantitative, offset levels to halfway points
if (!is.factor(x)) splits <- (splits[-length(splits)] + diff(splits) / 2)
# compute deviance for each potential split
deviances <- sapply(splits, function(split) {
partition <- partitioner(split)
# partitions with members whose cardinality is less than the minsize are invalid
if (any(sapply(partition, length) < control$minsize)) NaN
else sum(sapply(partition, function(part) 2 * length(part) * metric(part)))
})
optimalIndex <- which.min(deviances)
if (length(optimalIndex) == 0) return(c(split=NaN, deviance=NaN))
list(split=splits[[optimalIndex]], deviance=deviances[[optimalIndex]])
}
# ignore x if x is pure
columns <- colnames(X)[sapply(colnames(X),
function(column) length(unique(X[,column])) > 1)]
# end if all predictors are pure
if (length(columns) == 0) return()
# find best split for each column
candidates <- sapply(columns, function(column) find_best_split(X[,column]))
splitIndex <- which.min(candidates["deviance",])
# end if there is no minimum index (empty or all NaN candidates)
if (length(splitIndex) == 0) return()
splitColumn <- columns[[splitIndex]]
splitData <- X[[splitColumn]]
splitValue <- candidates[["split", splitColumn]]
# indices that belong in left split are dependent on nature of x
leftIndices <- which(if (is.factor(splitData)) splitData == splitValue
else splitData < splitValue)
# print and recurse
cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "=="
else "<", splitValue, length(leftIndices), " \n"))
fitTree(X[leftIndices,], y[leftIndices],
control, rootdev=rootdev, prefix=paste0(prefix, " "))
cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "!="
else ">", splitValue, nrow(X) - length(leftIndices)), " \n")
fitTree(X[-leftIndices,], y[-leftIndices],
control, rootdev=rootdev, prefix=paste0(prefix, " "))
}
```
```{R echo=F, message=F, results='markdown'}
# 3.a
Hitters <- na.omit(ISLR::Hitters[,c("Years", "Hits", "Salary")])
log <- fitTree(Hitters[,c("Years", "Hits")], log(Hitters$Salary),
control=list(mincut=5, minsize=10, mindev=.35))
```
## Question 3.b R Comparison
```{R echo=F, message=F, results='asis'}
# 3.b
tree::tree(log(Salary) ~ Years + Hits, Hitters)
```
## Question 4.a Custom Implementation
```{R echo=F, message=F, results='markdown'}
# 4.a
heart <- read.csv('./data/Heart.csv')
heart$Thal <- as.factor(heart$Thal)
heart$Ca <- as.factor(heart$Ca)
heart$AHD <- as.factor(heart$AHD)
heart <- na.omit(heart[,c("AHD", "Thal", "Ca")])
log <- fitTree(heart[,c("Thal", "Ca")], heart$AHD,
control=list(mincut=5, minsize=10, mindev=.01))
```
## Question 4.b R Comparison
```{R echo=F, message=F, results='asis'}
# 4.b
tree::tree(AHD ~ Thal + Ca, heart, split="deviance")
```
# Section 2
```{r show-code, ref.label=knitr::all_labels(), echo = TRUE, eval=FALSE}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment