-
-
Save trinker/f90b28662a728b708124bb4f26a375dc to your computer and use it in GitHub Desktop.
| library(tidyverse) | |
| as.data.frame(M, stringsAsFactors = FALSE) %>% | |
| rownames_to_column('id') %>% | |
| mutate( | |
| id = gsub('SuperSMART_', 'S', id), | |
| id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) | |
| ) %>% | |
| separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>% | |
| mutate(., group = group_indices(., S)) | |
| ## id S R p x group | |
| ## 1 S003_T1_p555 S003 T1 p555 1 1 | |
| ## 2 S003_T2_p456 S003 T2 p456 2 1 | |
| ## 3 S004_T3_p785 S004 T3 p785 3 2 | |
| ## 4 S004_T4_p426 S004 T4 p426 4 2 | |
| ## 5 S027_T1_p112 S027 T1 p112 5 3 | |
| ## 6 S027_T2_p414 S027 T2 p414 6 3 | |
| ## 7 S042_T3_p155 S042 T3 p155 7 4 | |
| ## 8 S042_T5_p775 S042 T5 p775 8 4 | |
| ## If you really want it as a function: | |
| normalize_data <- function(m, ..) { | |
| as.data.frame(m, stringsAsFactors = FALSE) %>% | |
| tibble::rownames_to_column('id') %>% | |
| dplyr::mutate( | |
| id = gsub('SuperSMART_', 'S', id), | |
| id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) | |
| ) %>% | |
| tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>% | |
| dplyr::mutate(., group = dplyr::group_indices(., S)) | |
| } | |
So this is a groupped capture denoted by the parenthesis '(^S)(\d{2})(_)'. There are 3 groups being captured. 1: (^S), 2:(\d{2}), 3: (_). The first one says grab from the beginning (^) and S. The second group says grab after that where there are exactly 2 digits (\\d{2}) and then the 3rd group says it must be followed by an underscore.
So S27_T2_p414 would be matched by this but S004_T3_p785 would not.
For the replacment of '\10\2\3'....If it matches '(^S)(\d{2})(_)' we can use perl = TRUE to replace the group capturing (denoted by parenthesis above. The \1 corresponds to (^S) ; the \2 corresponds to (\d{2}) AND \3 goes with (_). We can insert things in between the capture groups. This technique is called backreference. In this case I insert an extra zero between the first capture group and the second to ensure all numbers have 3 digits. This makes an assumption that at most you have 3 digits in the string after S.
Hi Tyler. Thanks for this solution again. can you please explain the gsub('(^S)(\d{2})(_)', '\10\2\3', id, perl = TRUE) line?
I think I understand what has been done in all other lines.
Regards