Skip to content

Instantly share code, notes, and snippets.

@andland
Last active December 18, 2015 22:29
Show Gist options
  • Save andland/5855059 to your computer and use it in GitHub Desktop.
Save andland/5855059 to your computer and use it in GitHub Desktop.
A simple nearest neighbor algorithm for a dataset with categorical variables. This code written for the Amazon Employee Access challenge on Kaggle.com.
# rm(list=ls())
setwd("Kaggle/Amazon Employee")
train = read.csv("train.csv")
test = read.csv("test.csv")
train$ROLE_TITLE <- NULL # Because the same as ROLE_CODE
test$ROLE_TITLE <- NULL # Because the same as ROLE_CODE
jaccard <- function(vec, matrix) {
rowSums(as.matrix(sweep(matrix, 2, as.numeric(vec), "==")))
}
pred.jaccard <- function(train.x, test.x, train.y, test.y, do.trace=100) {
nn.pred = numeric(nrow(test.x))
for (i in 1:nrow(test.x)) {
if (i %% do.trace == 0) {
cat(i / nrow(test.x), "\n")
}
temp = jaccard(test.x[i, ], train.x)
neighbs = (temp == max(temp))
nn.pred[i] = mean(train.y[neighbs])
}
if (missing(test.y)) {
nn.df = data.frame(Pred = nn.pred)
} else {
nn.df = data.frame(Pred = nn.pred, ACTION = test.y)
}
nn.df
}
test.nn = pred.jaccard(train[, -1], test[, -1], train$ACTION)
write.csv(cbind(id = test$id,
ACTION = test.nn$Pred),
file = "submission-NearestNeighbor.csv",
row.names = FALSE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment