Last active
January 24, 2017 20:50
-
-
Save beader/119049e95df37ef9814c to your computer and use it in GitHub Desktop.
Convert a dgcMatrix to libsvm format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' convert a dgcMatrix to libsvm format | |
#' @param sm A sparse matrix of class "dgcMatrix" | |
#' @param label label for dataset, default is 0 | |
#' @return a vector of characters containing index:value | |
#' @example | |
#' regMat <- matrix(runif(16), 4, 4) | |
#' regMat[sample(16, 5)] <- 0 | |
#' sparseMat <- Matrix(regMat, sparse = T) | |
#' conv2libsvm(sparseMat) | |
conv2libsvm <- function(sm, label = rep(0, dim(sm)[1])) { | |
stopifnot(dim(sm)[1] == length(label)) | |
tsm <- Matrix::t(sm) | |
i <- tsm@i | |
p <- tsm@p | |
x <- tsm@x | |
vapply(seq(dim(tsm)[2]), function(c) { | |
idx <- (p[c]+1):p[c+1] | |
paste(label[c], paste(i[idx], x[idx], sep = ":", collapse = " ")) | |
}, FUN.VALUE = character(1)) | |
} |
Thanks for this 👍 I think the index is off by 1 though. Here is a suggested fix:
data(agaricus.train, package='xgboost')
conv2libsvm <- function(sm, label = rep(0, dim(sm)[1])) {
stopifnot(dim(sm)[1] == length(label))
tsm <- Matrix::t(sm)
i <- tsm@i
p <- tsm@p
x <- tsm@x
vapply(seq(dim(tsm)[2]), function(c) {
idx <- (p[c]+1):p[c+1]
paste(label[c], paste(i[idx]+1, x[idx], sep = ":", collapse = " ")) #Note +1 here
}, FUN.VALUE = character(1))
}
conv2libsvm(agaricus.train$data,agaricus.train$label)
Gives first line output of
1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
which matches first line of file xgboost-master/demo/binary_classification/agaricus.txt.train
Without the +1 we get first 3 lines
[1] "1 2:1 9:1 10:1 20:1 29:1 33:1 35:1 39:1 40:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1"
[2] "0 2:1 9:1 19:1 20:1 22:1 33:1 35:1 38:1 40:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 119:1"
[3] "0 0:1 9:1 18:1 20:1 23:1 33:1 35:1 38:1 41:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 121:1"
Note by the 3rd row we have zero based index, which is not consistent with R being 1 based.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You can use this to create a labeled libsvm by setting
label
to a data frame column instead ofrep(0, dim(sm)[1])