Skip to content

Instantly share code, notes, and snippets.

@rmflight
Last active November 11, 2018 19:16
Show Gist options
  • Select an option

  • Save rmflight/330ffc13435fc20b8a949503d7778693 to your computer and use it in GitHub Desktop.

Select an option

Save rmflight/330ffc13435fc20b8a949503d7778693 to your computer and use it in GitHub Desktop.
removing duplicate entries across rows and columns
ex_data = data.frame(A = c("A", "C", "E", "F", "G", "H", "I"),
B = c("B", "D", "A", "E", "I", "J", "K"),
C = "C",
stringsAsFactors = FALSE)
irow = 2
consider_cols = c("A", "B")
all_entries = unlist(ex_data[1, consider_cols], use.names = FALSE)
while (irow <= nrow(ex_data)) {
message(c(irow, nrow(ex_data)))
new_entries = unlist(ex_data[irow, consider_cols], use.names = FALSE)
if (any(new_entries %in% all_entries)) {
ex_data = ex_data[-irow, ]
} else {
all_entries = c(all_entries, new_entries)
irow = irow + 1
}
}
print(ex_data)
@rmflight
Copy link
Author

So we start by saving all_entries is the first row, and we unlist it because it is a data.frame and really we just want the entries.

Then, while irow is less than the total rows (so this will go until the end, no matter how big, I think), we get the next rows entries, and check if any were in all the entries thus far. If yes, get rid of that row, and don't update the row incrementer, if no, add the entries to all, and then increment the counter to go to the next row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment