Last active
May 28, 2017 11:57
-
-
Save ettorerizza/04e3d2ae68e2a682004cbedb8666f987 to your computer and use it in GitHub Desktop.
Given a character vector as input, get the key collision fingerprint for each element. Forked from refinr package.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' Get key collision fingerprints | |
#' | |
#' Given a character vector as input, get the key collision fingerprint for | |
#' each element. | |
#' | |
#' Operations in order : | |
#' | |
#'-remove leading and trailing whitespace | |
#'-change all characters to their lowercase representation | |
#'-remove all punctuation and control characters | |
#'-split the string into whitespace-separated tokens | |
#'-sort the tokens and remove duplicates | |
#'-join the tokens back together | |
#'-normalize extended western characters to their ASCII representation (for example "gödel" → "godel") | |
#' | |
#'For more details on key collision, see | |
#' \url{https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth}. | |
#' | |
#' @param vect Character vector. | |
#' | |
#' | |
#' @return Key collision value of the input string. | |
#' @importFrom dplyr "%>%" | |
#' | |
#' @examples \dontrun{ | |
#' get_fingerprint("Tom's Sports Equipment, Inc.") | |
#' [1] "equipment inc sports toms" | |
#' } | |
get_fingerprint <- function(vect) { | |
stopifnot(is.character(vect)) | |
vect %>% | |
tolower %>% | |
{gsub("[[:punct:]]", "", .)} %>% | |
trimws %>% | |
{gsub("\\s{2,}", " ", .)} %>% | |
strsplit(., " ", fixed = TRUE) %>% | |
lapply(., sort) %>% | |
lapply(., unique) %>% | |
vapply(., function(x) paste(x, collapse = " "), character(1)) %>% | |
iconv(., to = "ASCII//TRANSLIT") %>% | |
sapply(., function(x) if (x == "" || is.na(x)) {NA} else {x}, | |
USE.NAMES = FALSE) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment