Skip to content

Instantly share code, notes, and snippets.

@Libardo1
Forked from nonsleepr/tokenizer.R
Last active August 29, 2015 14:17
Show Gist options
  • Select an option

  • Save Libardo1/4a8748e9b13509129bda to your computer and use it in GitHub Desktop.

Select an option

Save Libardo1/4a8748e9b13509129bda to your computer and use it in GitHub Desktop.
ngrams.tokenizer <- function(x, n = 2) {
trim <- function(x) gsub("(^\\s+|\\s+$)", "", x)
terms <- strsplit(trim(x), split = "\\s+")[[1]]
ngrams <- vector()
if (length(terms) >= n) {
for (i in n:length(terms)) {
ngram <- paste(terms[(i-n+1):i], collapse = " ")
ngrams <- c(ngrams,ngram)
}
}
ngrams
}
ngrams.tokenizer(" this is a sentense to be ngrammized", 3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment