Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save trinker/f90b28662a728b708124bb4f26a375dc to your computer and use it in GitHub Desktop.
Save trinker/f90b28662a728b708124bb4f26a375dc to your computer and use it in GitHub Desktop.
Mutlistep Cleaning/Regex: Substitution & Extract Portion Before Separator [on hold]
library(tidyverse)
as.data.frame(M, stringsAsFactors = FALSE) %>%
rownames_to_column('id') %>%
mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
mutate(., group = group_indices(., S))
## id S R p x group
## 1 S003_T1_p555 S003 T1 p555 1 1
## 2 S003_T2_p456 S003 T2 p456 2 1
## 3 S004_T3_p785 S004 T3 p785 3 2
## 4 S004_T4_p426 S004 T4 p426 4 2
## 5 S027_T1_p112 S027 T1 p112 5 3
## 6 S027_T2_p414 S027 T2 p414 6 3
## 7 S042_T3_p155 S042 T3 p155 7 4
## 8 S042_T5_p775 S042 T5 p775 8 4
## If you really want it as a function:
normalize_data <- function(m, ..) {
as.data.frame(m, stringsAsFactors = FALSE) %>%
tibble::rownames_to_column('id') %>%
dplyr::mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
dplyr::mutate(., group = dplyr::group_indices(., S))
}
@drhmoosavi
Copy link

Hi Tyler. Thanks for this solution again. can you please explain the gsub('(^S)(\d{2})(_)', '\10\2\3', id, perl = TRUE) line?
I think I understand what has been done in all other lines.
Regards

@trinker
Copy link
Author

trinker commented Jun 27, 2018

So this is a groupped capture denoted by the parenthesis '(^S)(\d{2})(_)'. There are 3 groups being captured. 1: (^S), 2:(\d{2}), 3: (_). The first one says grab from the beginning (^) and S. The second group says grab after that where there are exactly 2 digits (\\d{2}) and then the 3rd group says it must be followed by an underscore.

So S27_T2_p414 would be matched by this but S004_T3_p785 would not.

For the replacment of '\10\2\3'....If it matches '(^S)(\d{2})(_)' we can use perl = TRUE to replace the group capturing (denoted by parenthesis above. The \1 corresponds to (^S) ; the \2 corresponds to (\d{2}) AND \3 goes with (_). We can insert things in between the capture groups. This technique is called backreference. In this case I insert an extra zero between the first capture group and the second to ensure all numbers have 3 digits. This makes an assumption that at most you have 3 digits in the string after S.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment