Shoeboxam · April 24, 2019 16:34
diff --git a/statMLP6.Rmd b/statMLP6.Rmd
 ---
 title: "Statistical/Machine Learning Mini-Project VI"
 author: "Michael Shoemate"
 date: "April 23, 2019"
 output: pdf_document
 ---

 # Section 1
 ```{R echo = F, results = 'hide', message=F, warning=F}
 # installs each package if not exists, and attaches
 import <- function(packages) {
  log <- lapply(packages, function(package) {
    if (inherits(try(library(package, character.only = TRUE)), 'try-error')) {
      install.packages(package, dependencies =  TRUE)
      library(package, character.only = TRUE)
    }
  })
 }
 import(c("ggplot2", "gridExtra", "reshape2", "dplyr",
         "lmtest", "xtable", "MASS", "lattice", "caret", "pROC", "boot", "broom",
         "ISLR", "leaps", "glmnet", "ElemStatLearn", "glmnetUtils"))
 options(xtable.comment = FALSE, xtable.floating=FALSE)
 options(warn=-1)

 printStatistics <- function(yTrue, yPred, positiveLabel=NULL) {
  confWrapper <- caret::confusionMatrix(yPred, yTrue, positive=positiveLabel)

  cat('  ')
  cat('Confusion Matrix (predicted on y, actual on x)  ')
  print(xtable(confWrapper$table, digits=4))
  statistics <- data.frame(
    Sensitivity=confWrapper$byClass['Sensitivity'],
    Specificity=confWrapper$byClass['Specificity'],
    MissRate=1 - confWrapper$byClass['Sensitivity'],
    ErrorRate=1 - confWrapper$overall['Accuracy']
  )
  rownames(statistics) <- NULL
  cat('Summary Statistics  ')
  print(xtable(statistics, digits=4))
 }

 printWrapperStatistics <- function(trainWrapper, positiveLabel) {
  labels <- unique(trainWrapper$trainingData$.outcome)
  yTrue <- as.numeric(trainWrapper$trainingData$.outcome == positiveLabel)
  yTrue <- factor(yTrue, levels = c(0, 1), labels = labels)

  yProb <- aggregate(formula(paste(positiveLabel, '~rowIndex')), 
                     trainWrapper$pred, mean)[,positiveLabel]
  yPred <- factor(yProb > .5, levels = c(F, T), labels=labels)

  printStatistics(yTrue, yPred)
 }

 ```

 ## 1.a Data Exploration
 By inspecting the data, Manufacturer and Group are the same, so one may be dropped.  
 ```{R echo=F, message=F, results='asis'}
 ##### Question 1
 # 1.a
 cereal <- read.csv('./data/cereal.csv')
 xtable(summary(cereal)[,1:4])
 cat('  ')
 xtable(summary(cereal)[,5:8])
 cat('  ')
 xtable(summary(cereal)[,9:11])

 colsClust <- c("Manufacturer", "Calories", "Protein", "Fat", "Sodium", 
               "Fiber", "Carbohydrates", "Sugar", "Potassium")
 colsQuant <- c("Calories", "Protein", "Fat", "Sodium", "Fiber", 
               "Carbohydrates", "Sugar", "Potassium")
 cerealClust <- cereal[,colsQuant]
 cat("  \n")
 cat("Correlation matrix among quantitative predictors:  ")
 xtable(cor(cerealClust))

 cat("  \n")
 cat("Selected variable pair plots:   ")
 cat("  \n")
 pairs(~Manufacturer+Calories+Fat+Carbohydrates, data=cereal)
 ```

 Fiber and potassium have significant correlation, but most other variables are relatively independent.  

 ## 1.b Standardizing
 Variables should typically be standardized before distance-based clustering. The scale of a variable acts as the importance or weight of the variable. By standardizing the data, the importance given to each variable is equivalent. Variables do not need to be standardized for correlation-based clustering, because the process of computing the correlation includes a standardization.  

 ## 1.c Method
 In this situation, distance based metrics make more sense. I care less about detecting pattern profiles among cereals devoid of scale, and more about finding groups with similar nutritional statistics.  

 ## 1.d Hierarchical Clustering
 There is a distance metric for mixed variable types, via the daisy gower metric, but I left it commented because the question asks explicitly for euclidean distance. Considering this, I just removed the categorical variables `Manufacturer` and `Group`. Another option would be to to apply a onehot expansion to the qualitative variables, but this has the downside of increasing the dimensionality. The visualization is a cluster dendrogram based on standardized euclidean distance, with complete link hierarchical clustering.   

 ```{R echo=F, message=F, results='asis'}
 # 1.d
 # plot(stats::hclust(cluster::daisy(cerealClust, stand=T), method="complete"))
 # plot(stats::hclust(as.dist(1 - abs(cor(t(cerealClust)))), method="complete"))
 plot(stats::hclust(dist(cerealClust), method="complete"), xlab='Observation index')

 ```

 Visually, I consider observation 18 an outlier, and would break the cereals into 3 or 4 groups. The center group is could potentially be split to make the fourth group.  

 ## Question 1.e K-Means
 K-Means is fit with different values of K. Each value of K is compared via $between SS / total SS$.  

 ```{R echo=F, message=F, results='asis'}
 # 1.e
 K_sizes <- 2:4
 K_performances <- sapply(K_sizes, function(K) {
  seeds <- cereal[sample(1:nrow(cereal), K),colsQuant]
  fit <- stats::kmeans(cereal[,colsQuant], seeds)
  fit$betweenss / fit$totss
 })
 plot(K_sizes, K_performances)

 ```


 Among these three values of K, I would choose four clusters, because the bend in performance compared to two and four is negligible, and the cluster accuracy improves. Given that this is unsupervised and incredibly subjective, I could very well argue for three instead (i.e. the bend is important!).  

 ## Question 1.d Comparison
 Again, I could make a case for either clustering scheme. I'm going to recommend K-Means, because the centroids are useful for characterizing the clusters. On the other hand, with the hierarchical clustering, the dendrogram is useful for visualizing how close individual cereals are to their groups.  

 ## Question 2.a Decision Tree
 A decision tree is fit to the data.  

 ```{R echo=F, message=F, results='asis'}
 ##### Question 2

 analyze <- function(y, x, data, method, hyperparameters=list()) {
  # This performs model training over k folds, as well as grid searches 
  do.call(caret::train, c(list(
    formula(paste(y, '~', paste(x, collapse='+'))),
    data=data,
    method=method,
    trControl=caret::trainControl(
      # method = "cv",
      number=2,
      # repeats=1,
      savePredictions = TRUE,
      classProbs = TRUE)
  ), hyperparameters))
 }
 test <- head(ISLR::Caravan, 1000)
 train <- head(ISLR::Caravan, -1000)
 ```

 The regions in the discovered decision tree, in left to right order, are:  
 1. "No" if PPERSAUT is less than six.  
 2. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is less than one.  
 3. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is greater than or equal to two.  
 4. "Yes" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is less than two. This is the only case where an observation is predicted "Yes". This can often occur in an unbalanced dataset or a dataset with low variance in a class.  
 5. "No" if PPERSAUT is greater than six and MOSTYPE is less than nine.  

 The branches in the tree exist to isolate a hyper-rectangle in the predictor space that is labeled "Yes".  

 ```{R echo=F, message=F, results='asis'}
 # 2.a
 fit <- rpart::rpart(Purchase ~ ., train, 
                    control=rpart::rpart.control(minsplit=10, cp=0.005))
 rpart.plot::rpart.plot(fit)
 printStatistics(test$Purchase, 
                predict(fit, test, type="class"), positiveLabel = "Yes")
 cat('  \n')

 ```

 ## Question 2.a Pruning
 ```{R echo=F, message=F, results='asis'}
 # 2.b
 cvGrid <- seq(0., .01, .0005)
 cvPerformances <- sapply(cvGrid, function(cp) {
  pruned <- rpart::prune.rpart(fit, cp=cp)
  mean(test$Purchase == predict(pruned, test, type="class"))
 })
 plot(cvGrid, cvPerformances, 
     main="CV performance for levels of CP pruning", 
     xlab="CP", ylab="Accuracy")

 optimalID <- tail(which(cvPerformances == max(cvPerformances)), n=1)
 optimalCP <- cvGrid[optimalID]
 optimalPerf <- cvPerformances[optimalID]
 cat('  \n')
 cat(paste0('The optimal cp for pruning is ', optimalCP, 
           ' with an error of ', optimalPerf, '.'))

 fitPruned <- rpart::prune.rpart(fit, cp=optimalCP)
 printStatistics(test$Purchase, predict(fitPruned, test, type="class"), 
                positiveLabel = "Yes")

 ```

 The pruned and unpruned trees seem to be the same.  

 ## Question 2.c Bagged Decision Tree

 ```{R echo=F, message=F, results='asis'}
 # 2.c
 B <- 1000
 # my computer is dying with B=1000 on this model
 fit <- analyze('Purchase', c('.'), train, 'treebag',
               hyperparameters=list(nbagg=B/5))
 cat('  ')
 cat("The top 10 most important variables are shown:  ")
 cat('  \n')
 plot(caret::varImp(fit), top=10)
 printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")

 ```

 ## Question 2.d Random Forest
 Using the ranger library, `m`, the number of variables to split at each node is by default $\sqrt{p}$.  

 ```{R echo=F, message=F, results='asis'}
 # 2.d
 fit <- analyze('Purchase', c('.'), train, 'ranger',
               hyperparameters=list(num.trees=B))

 printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")

 ```

 ## Question 2.e Boosted Decision Tree
 ```{R echo=F, message=F, results='hide'}
 # 2.d
 fit <- analyze('Purchase', c('.'), train, 'gbm', hyperparameters=list(
  tuneGrid=expand.grid(
    n.trees=B,
    interaction.depth=1,
    shrinkage=.01,
    n.minobsinnode=10
    )))

 ```
 ```{R echo=F, message=F, results='asis'}
 printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")
 ```

 ## Question 2.f KNN
 ```{R echo=F, message=F, results='asis'}
 # 2.f
 fit <- analyze('Purchase', c('.'), train, 'knn',
               hyperparameters = list(tuneGrid=expand.grid(k=1:30)))
 cat('  \n')
 cat(paste('The optimal value of K is:', fit$results$K))
 cat('  \n')
 printStatistics(test$Purchase, predict(fit, test, type="raw"), positiveLabel="Yes")

 ```

 ## Question 2.g Lasso Logistic Regression
 ```{R echo=F, message=F, results='asis'}
 # 2.g
 xTrain <- as.matrix(train[,!(colnames(train) == "Purchase")])[,-1]
 yTrain <- ifelse(train$Purchase == "Yes", 1, 0)
 xTest <- as.matrix(test[,!(colnames(test) == "Purchase")])[,-1]

 grid <- 10^seq(1, -2, length=100)

 cvTrainModel <- function(alpha) {
  accuracies <- sapply(grid, function(lambda) {
    fitted <- glmnet(xTrain, yTrain, family="binomial",
                     alpha=alpha, lambda=lambda, maxit = 500)
    predLabels <- factor(predict(fitted, xTest, type="response") > .2, 
                         levels=c(F, T), labels=c("No", "Yes"))
    mean(test$Purchase == predLabels)
  })
  
  bestLambda <- grid[[first(which(accuracies == max(accuracies)))]]
  cat(paste0('The best value of lambda is ', round(bestLambda, 4), '.'))
  model <- glmnet(xTrain, yTrain, family="binomial",
                  alpha=alpha, lambda=bestLambda, maxit = 500)
  print(xtable(as.data.frame(
    as.matrix(coef(model))[rowSums(as.matrix(coef(model))) > 0,])))
  predictions <- factor(
    predict(model, xTest, type="response") > .2, 
    levels=c(F, T), labels=c("No", "Yes"))
  printStatistics(test$Purchase, predictions)
 }
 cvTrainModel(1)

 ```

 ## Question 2.h Comparison
 The overall best model is definitely the bagged decision tree. Most models struggled with the imbalanced class sizes, but the bagged decision tree maintains 86\% sensitivity and nearly 100\% specificity. The error rate is around 1\%, which is significantly better than all of the other models, which tend to hover around 6\%.  

 ## Question 3.a Custom Implementation
 I wrote my own decision tree implementation based on the math in the book and the toy example. The results are similar, but there are some differences deeper into the tree. I spent some considerable time testing and debugging, and am convinced my math is correct. I'm not sure how to account for the differences.  
 ```{R echo=F, message=F, results='asis'}
 ### QUESTION 3/4 Tree Implementation
 get_entropy <- function(y) {
  count <- sapply(unique(y), function(level) sum(y == level))
  probability <- count / length(y)
  -sum(probability * log(probability))
 }
 get_sse <- function(y) sum((y - mean(y))^2) / length(y) / 2

 fitTree <- function(X, y, control=list(mincut=5, minsize=10, mindev=.01), 
                    rootdev=NA, prefix="") {
  
  # choice of metric dependent on regression vs. classification tree
  metric <- if (is.factor(y)) get_entropy else get_sse
  
  # base cases
  fitness <- metric(y)
  if (is.na(rootdev)) rootdev = fitness
  if (!is.finite(fitness) || fitness < control$mindev * rootdev) return("A")
  if (length(y) < control$minsize) return("B")
  
  find_best_split <- function(x) {
    partitioner <- if (is.factor(x)) 
      function(split) c(left=list(y[x == split]), right=list(y[x != split])) 
    else 
      function(split) c(left=list(y[x < split]), right=list(y[x > split]))
    
    # can split on any level of x
    splits <- sort(unique(x))
    # if quantitative, offset levels to halfway points
    if (!is.factor(x)) splits <- (splits[-length(splits)] + diff(splits) / 2)
    
    # compute deviance for each potential split
    deviances <- sapply(splits, function(split) {
      partition <- partitioner(split)
      # partitions with members whose cardinality is less than the minsize are invalid
      if (any(sapply(partition, length) < control$minsize)) NaN 
      else sum(sapply(partition, function(part) 2 * length(part) * metric(part)))
    })
    optimalIndex <- which.min(deviances)
    if (length(optimalIndex) == 0) return(c(split=NaN, deviance=NaN))
    list(split=splits[[optimalIndex]], deviance=deviances[[optimalIndex]])
  }
  
  # ignore x if x is pure
  columns <- colnames(X)[sapply(colnames(X), 
                                function(column) length(unique(X[,column])) > 1)]
  # end if all predictors are pure
  if (length(columns) == 0) return()
  
  # find best split for each column
  candidates <- sapply(columns, function(column) find_best_split(X[,column]))
  splitIndex <- which.min(candidates["deviance",])
  # end if there is no minimum index (empty or all NaN candidates)
  if (length(splitIndex) == 0) return()
  
  splitColumn <- columns[[splitIndex]]
  splitData <- X[[splitColumn]]
  splitValue <- candidates[["split", splitColumn]]

  # indices that belong in left split are dependent on nature of x
  leftIndices <- which(if (is.factor(splitData)) splitData == splitValue 
                       else splitData < splitValue)
  
  # print and recurse
  cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "==" 
            else "<", splitValue, length(leftIndices), "  \n"))
  
  fitTree(X[leftIndices,], y[leftIndices], 
          control, rootdev=rootdev, prefix=paste0(prefix, "  "))
  
  cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "!=" 
            else ">", splitValue, nrow(X) - length(leftIndices)), "  \n")
  
  fitTree(X[-leftIndices,], y[-leftIndices], 
          control, rootdev=rootdev, prefix=paste0(prefix, "  "))    
 }


 ```

 ```{R echo=F, message=F, results='markdown'}
 # 3.a
 Hitters <- na.omit(ISLR::Hitters[,c("Years", "Hits", "Salary")])
 log <- fitTree(Hitters[,c("Years", "Hits")], log(Hitters$Salary), 
        control=list(mincut=5, minsize=10, mindev=.35))

 ```

 ## Question 3.b R Comparison
 ```{R echo=F, message=F, results='asis'}
 # 3.b
 tree::tree(log(Salary) ~ Years + Hits, Hitters)

 ```

 ## Question 4.a Custom Implementation
 ```{R echo=F, message=F, results='markdown'}
 # 4.a
 heart <- read.csv('./data/Heart.csv')
 heart$Thal <- as.factor(heart$Thal)
 heart$Ca <- as.factor(heart$Ca)
 heart$AHD <- as.factor(heart$AHD)
 heart <- na.omit(heart[,c("AHD", "Thal", "Ca")])
 log <- fitTree(heart[,c("Thal", "Ca")], heart$AHD, 
        control=list(mincut=5, minsize=10, mindev=.01))

 ```

 ## Question 4.b R Comparison
 ```{R echo=F, message=F, results='asis'}
 # 4.b
 tree::tree(AHD ~ Thal + Ca, heart, split="deviance")
 ```

 # Section 2
 ```{r show-code, ref.label=knitr::all_labels(), echo = TRUE, eval=FALSE}

 ```
	---
	title: "Statistical/Machine Learning Mini-Project VI"
	author: "Michael Shoemate"
	date: "April 23, 2019"
	output: pdf_document
	---

	# Section 1
	```{R echo = F, results = 'hide', message=F, warning=F}
	# installs each package if not exists, and attaches
	import <- function(packages) {
	log <- lapply(packages, function(package) {
	if (inherits(try(library(package, character.only = TRUE)), 'try-error')) {
	install.packages(package, dependencies = TRUE)
	library(package, character.only = TRUE)
	}
	})
	}
	import(c("ggplot2", "gridExtra", "reshape2", "dplyr",
	"lmtest", "xtable", "MASS", "lattice", "caret", "pROC", "boot", "broom",
	"ISLR", "leaps", "glmnet", "ElemStatLearn", "glmnetUtils"))
	options(xtable.comment = FALSE, xtable.floating=FALSE)
	options(warn=-1)

	printStatistics <- function(yTrue, yPred, positiveLabel=NULL) {
	confWrapper <- caret::confusionMatrix(yPred, yTrue, positive=positiveLabel)

	cat(' ')
	cat('Confusion Matrix (predicted on y, actual on x) ')
	print(xtable(confWrapper$table, digits=4))
	statistics <- data.frame(
	Sensitivity=confWrapper$byClass['Sensitivity'],
	Specificity=confWrapper$byClass['Specificity'],
	MissRate=1 - confWrapper$byClass['Sensitivity'],
	ErrorRate=1 - confWrapper$overall['Accuracy']
	)
	rownames(statistics) <- NULL
	cat('Summary Statistics ')
	print(xtable(statistics, digits=4))
	}

	printWrapperStatistics <- function(trainWrapper, positiveLabel) {
	labels <- unique(trainWrapper$trainingData$.outcome)
	yTrue <- as.numeric(trainWrapper$trainingData$.outcome == positiveLabel)
	yTrue <- factor(yTrue, levels = c(0, 1), labels = labels)

	yProb <- aggregate(formula(paste(positiveLabel, '~rowIndex')),
	trainWrapper$pred, mean)[,positiveLabel]
	yPred <- factor(yProb > .5, levels = c(F, T), labels=labels)

	printStatistics(yTrue, yPred)
	}

	```

	## 1.a Data Exploration
	By inspecting the data, Manufacturer and Group are the same, so one may be dropped.
	```{R echo=F, message=F, results='asis'}
	##### Question 1
	# 1.a
	cereal <- read.csv('./data/cereal.csv')
	xtable(summary(cereal)[,1:4])
	cat(' ')
	xtable(summary(cereal)[,5:8])
	cat(' ')
	xtable(summary(cereal)[,9:11])

	colsClust <- c("Manufacturer", "Calories", "Protein", "Fat", "Sodium",
	"Fiber", "Carbohydrates", "Sugar", "Potassium")
	colsQuant <- c("Calories", "Protein", "Fat", "Sodium", "Fiber",
	"Carbohydrates", "Sugar", "Potassium")
	cerealClust <- cereal[,colsQuant]
	cat(" \n")
	cat("Correlation matrix among quantitative predictors: ")
	xtable(cor(cerealClust))

	cat(" \n")
	cat("Selected variable pair plots: ")
	cat(" \n")
	pairs(~Manufacturer+Calories+Fat+Carbohydrates, data=cereal)
	```

	Fiber and potassium have significant correlation, but most other variables are relatively independent.

	## 1.b Standardizing
	Variables should typically be standardized before distance-based clustering. The scale of a variable acts as the importance or weight of the variable. By standardizing the data, the importance given to each variable is equivalent. Variables do not need to be standardized for correlation-based clustering, because the process of computing the correlation includes a standardization.

	## 1.c Method
	In this situation, distance based metrics make more sense. I care less about detecting pattern profiles among cereals devoid of scale, and more about finding groups with similar nutritional statistics.

	## 1.d Hierarchical Clustering
	There is a distance metric for mixed variable types, via the daisy gower metric, but I left it commented because the question asks explicitly for euclidean distance. Considering this, I just removed the categorical variables `Manufacturer` and `Group`. Another option would be to to apply a onehot expansion to the qualitative variables, but this has the downside of increasing the dimensionality. The visualization is a cluster dendrogram based on standardized euclidean distance, with complete link hierarchical clustering.

	```{R echo=F, message=F, results='asis'}
	# 1.d
	# plot(stats::hclust(cluster::daisy(cerealClust, stand=T), method="complete"))
	# plot(stats::hclust(as.dist(1 - abs(cor(t(cerealClust)))), method="complete"))
	plot(stats::hclust(dist(cerealClust), method="complete"), xlab='Observation index')

	```

	Visually, I consider observation 18 an outlier, and would break the cereals into 3 or 4 groups. The center group is could potentially be split to make the fourth group.

	## Question 1.e K-Means
	K-Means is fit with different values of K. Each value of K is compared via $between SS / total SS$.

	```{R echo=F, message=F, results='asis'}
	# 1.e
	K_sizes <- 2:4
	K_performances <- sapply(K_sizes, function(K) {
	seeds <- cereal[sample(1:nrow(cereal), K),colsQuant]
	fit <- stats::kmeans(cereal[,colsQuant], seeds)
	fit$betweenss / fit$totss
	})
	plot(K_sizes, K_performances)

	```


	Among these three values of K, I would choose four clusters, because the bend in performance compared to two and four is negligible, and the cluster accuracy improves. Given that this is unsupervised and incredibly subjective, I could very well argue for three instead (i.e. the bend is important!).

	## Question 1.d Comparison
	Again, I could make a case for either clustering scheme. I'm going to recommend K-Means, because the centroids are useful for characterizing the clusters. On the other hand, with the hierarchical clustering, the dendrogram is useful for visualizing how close individual cereals are to their groups.

	## Question 2.a Decision Tree
	A decision tree is fit to the data.

	```{R echo=F, message=F, results='asis'}
	##### Question 2

	analyze <- function(y, x, data, method, hyperparameters=list()) {
	# This performs model training over k folds, as well as grid searches
	do.call(caret::train, c(list(
	formula(paste(y, '~', paste(x, collapse='+'))),
	data=data,
	method=method,
	trControl=caret::trainControl(
	# method = "cv",
	number=2,
	# repeats=1,
	savePredictions = TRUE,
	classProbs = TRUE)
	), hyperparameters))
	}
	test <- head(ISLR::Caravan, 1000)
	train <- head(ISLR::Caravan, -1000)
	```

	The regions in the discovered decision tree, in left to right order, are:
	1. "No" if PPERSAUT is less than six.
	2. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is less than one.
	3. "No" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is greater than or equal to two.
	4. "Yes" if PPERSAUT is greater than six, MOSTYPE is greater than or equal to nine and PPLEZIER is greater than or equal to one, and PBRAND is less than two. This is the only case where an observation is predicted "Yes". This can often occur in an unbalanced dataset or a dataset with low variance in a class.
	5. "No" if PPERSAUT is greater than six and MOSTYPE is less than nine.

	The branches in the tree exist to isolate a hyper-rectangle in the predictor space that is labeled "Yes".

	```{R echo=F, message=F, results='asis'}
	# 2.a
	fit <- rpart::rpart(Purchase ~ ., train,
	control=rpart::rpart.control(minsplit=10, cp=0.005))
	rpart.plot::rpart.plot(fit)
	printStatistics(test$Purchase,
	predict(fit, test, type="class"), positiveLabel = "Yes")
	cat(' \n')

	```

	## Question 2.a Pruning
	```{R echo=F, message=F, results='asis'}
	# 2.b
	cvGrid <- seq(0., .01, .0005)
	cvPerformances <- sapply(cvGrid, function(cp) {
	pruned <- rpart::prune.rpart(fit, cp=cp)
	mean(test$Purchase == predict(pruned, test, type="class"))
	})
	plot(cvGrid, cvPerformances,
	main="CV performance for levels of CP pruning",
	xlab="CP", ylab="Accuracy")

	optimalID <- tail(which(cvPerformances == max(cvPerformances)), n=1)
	optimalCP <- cvGrid[optimalID]
	optimalPerf <- cvPerformances[optimalID]
	cat(' \n')
	cat(paste0('The optimal cp for pruning is ', optimalCP,
	' with an error of ', optimalPerf, '.'))

	fitPruned <- rpart::prune.rpart(fit, cp=optimalCP)
	printStatistics(test$Purchase, predict(fitPruned, test, type="class"),
	positiveLabel = "Yes")

	```

	The pruned and unpruned trees seem to be the same.

	## Question 2.c Bagged Decision Tree

	```{R echo=F, message=F, results='asis'}
	# 2.c
	B <- 1000
	# my computer is dying with B=1000 on this model
	fit <- analyze('Purchase', c('.'), train, 'treebag',
	hyperparameters=list(nbagg=B/5))
	cat(' ')
	cat("The top 10 most important variables are shown: ")
	cat(' \n')
	plot(caret::varImp(fit), top=10)
	printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")

	```

	## Question 2.d Random Forest
	Using the ranger library, `m`, the number of variables to split at each node is by default $\sqrt{p}$.

	```{R echo=F, message=F, results='asis'}
	# 2.d
	fit <- analyze('Purchase', c('.'), train, 'ranger',
	hyperparameters=list(num.trees=B))

	printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")

	```

	## Question 2.e Boosted Decision Tree
	```{R echo=F, message=F, results='hide'}
	# 2.d
	fit <- analyze('Purchase', c('.'), train, 'gbm', hyperparameters=list(
	tuneGrid=expand.grid(
	n.trees=B,
	interaction.depth=1,
	shrinkage=.01,
	n.minobsinnode=10
	)))

	```
	```{R echo=F, message=F, results='asis'}
	printStatistics(test$Purchase, predict(fit, test, type="raw"), "Yes")
	```

	## Question 2.f KNN
	```{R echo=F, message=F, results='asis'}
	# 2.f
	fit <- analyze('Purchase', c('.'), train, 'knn',
	hyperparameters = list(tuneGrid=expand.grid(k=1:30)))
	cat(' \n')
	cat(paste('The optimal value of K is:', fit$results$K))
	cat(' \n')
	printStatistics(test$Purchase, predict(fit, test, type="raw"), positiveLabel="Yes")

	```

	## Question 2.g Lasso Logistic Regression
	```{R echo=F, message=F, results='asis'}
	# 2.g
	xTrain <- as.matrix(train[,!(colnames(train) == "Purchase")])[,-1]
	yTrain <- ifelse(train$Purchase == "Yes", 1, 0)
	xTest <- as.matrix(test[,!(colnames(test) == "Purchase")])[,-1]

	grid <- 10^seq(1, -2, length=100)

	cvTrainModel <- function(alpha) {
	accuracies <- sapply(grid, function(lambda) {
	fitted <- glmnet(xTrain, yTrain, family="binomial",
	alpha=alpha, lambda=lambda, maxit = 500)
	predLabels <- factor(predict(fitted, xTest, type="response") > .2,
	levels=c(F, T), labels=c("No", "Yes"))
	mean(test$Purchase == predLabels)
	})

	bestLambda <- grid[[first(which(accuracies == max(accuracies)))]]
	cat(paste0('The best value of lambda is ', round(bestLambda, 4), '.'))
	model <- glmnet(xTrain, yTrain, family="binomial",
	alpha=alpha, lambda=bestLambda, maxit = 500)
	print(xtable(as.data.frame(
	as.matrix(coef(model))[rowSums(as.matrix(coef(model))) > 0,])))
	predictions <- factor(
	predict(model, xTest, type="response") > .2,
	levels=c(F, T), labels=c("No", "Yes"))
	printStatistics(test$Purchase, predictions)
	}
	cvTrainModel(1)

	```

	## Question 2.h Comparison
	The overall best model is definitely the bagged decision tree. Most models struggled with the imbalanced class sizes, but the bagged decision tree maintains 86\% sensitivity and nearly 100\% specificity. The error rate is around 1\%, which is significantly better than all of the other models, which tend to hover around 6\%.

	## Question 3.a Custom Implementation
	I wrote my own decision tree implementation based on the math in the book and the toy example. The results are similar, but there are some differences deeper into the tree. I spent some considerable time testing and debugging, and am convinced my math is correct. I'm not sure how to account for the differences.
	```{R echo=F, message=F, results='asis'}
	### QUESTION 3/4 Tree Implementation
	get_entropy <- function(y) {
	count <- sapply(unique(y), function(level) sum(y == level))
	probability <- count / length(y)
	-sum(probability * log(probability))
	}
	get_sse <- function(y) sum((y - mean(y))^2) / length(y) / 2

	fitTree <- function(X, y, control=list(mincut=5, minsize=10, mindev=.01),
	rootdev=NA, prefix="") {

	# choice of metric dependent on regression vs. classification tree
	metric <- if (is.factor(y)) get_entropy else get_sse

	# base cases
	fitness <- metric(y)
	if (is.na(rootdev)) rootdev = fitness
	if (!is.finite(fitness) \|\| fitness < control$mindev * rootdev) return("A")
	if (length(y) < control$minsize) return("B")

	find_best_split <- function(x) {
	partitioner <- if (is.factor(x))
	function(split) c(left=list(y[x == split]), right=list(y[x != split]))
	else
	function(split) c(left=list(y[x < split]), right=list(y[x > split]))

	# can split on any level of x
	splits <- sort(unique(x))
	# if quantitative, offset levels to halfway points
	if (!is.factor(x)) splits <- (splits[-length(splits)] + diff(splits) / 2)

	# compute deviance for each potential split
	deviances <- sapply(splits, function(split) {
	partition <- partitioner(split)
	# partitions with members whose cardinality is less than the minsize are invalid
	if (any(sapply(partition, length) < control$minsize)) NaN
	else sum(sapply(partition, function(part) 2 * length(part) * metric(part)))
	})
	optimalIndex <- which.min(deviances)
	if (length(optimalIndex) == 0) return(c(split=NaN, deviance=NaN))
	list(split=splits[[optimalIndex]], deviance=deviances[[optimalIndex]])
	}

	# ignore x if x is pure
	columns <- colnames(X)[sapply(colnames(X),
	function(column) length(unique(X[,column])) > 1)]
	# end if all predictors are pure
	if (length(columns) == 0) return()

	# find best split for each column
	candidates <- sapply(columns, function(column) find_best_split(X[,column]))
	splitIndex <- which.min(candidates["deviance",])
	# end if there is no minimum index (empty or all NaN candidates)
	if (length(splitIndex) == 0) return()

	splitColumn <- columns[[splitIndex]]
	splitData <- X[[splitColumn]]
	splitValue <- candidates[["split", splitColumn]]

	# indices that belong in left split are dependent on nature of x
	leftIndices <- which(if (is.factor(splitData)) splitData == splitValue
	else splitData < splitValue)

	# print and recurse
	cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "=="
	else "<", splitValue, length(leftIndices), " \n"))

	fitTree(X[leftIndices,], y[leftIndices],
	control, rootdev=rootdev, prefix=paste0(prefix, " "))

	cat(paste(prefix, "* ", splitColumn, if (is.factor(splitData)) "!="
	else ">", splitValue, nrow(X) - length(leftIndices)), " \n")

	fitTree(X[-leftIndices,], y[-leftIndices],
	control, rootdev=rootdev, prefix=paste0(prefix, " "))
	}


	```

	```{R echo=F, message=F, results='markdown'}
	# 3.a
	Hitters <- na.omit(ISLR::Hitters[,c("Years", "Hits", "Salary")])
	log <- fitTree(Hitters[,c("Years", "Hits")], log(Hitters$Salary),
	control=list(mincut=5, minsize=10, mindev=.35))

	```

	## Question 3.b R Comparison
	```{R echo=F, message=F, results='asis'}
	# 3.b
	tree::tree(log(Salary) ~ Years + Hits, Hitters)

	```

	## Question 4.a Custom Implementation
	```{R echo=F, message=F, results='markdown'}
	# 4.a
	heart <- read.csv('./data/Heart.csv')
	heart$Thal <- as.factor(heart$Thal)
	heart$Ca <- as.factor(heart$Ca)
	heart$AHD <- as.factor(heart$AHD)
	heart <- na.omit(heart[,c("AHD", "Thal", "Ca")])
	log <- fitTree(heart[,c("Thal", "Ca")], heart$AHD,
	control=list(mincut=5, minsize=10, mindev=.01))

	```

	## Question 4.b R Comparison
	```{R echo=F, message=F, results='asis'}
	# 4.b
	tree::tree(AHD ~ Thal + Ca, heart, split="deviance")
	```

	# Section 2
	```{r show-code, ref.label=knitr::all_labels(), echo = TRUE, eval=FALSE}

	```
No results found