-
-
Save trinker/f90b28662a728b708124bb4f26a375dc to your computer and use it in GitHub Desktop.
library(tidyverse) | |
as.data.frame(M, stringsAsFactors = FALSE) %>% | |
rownames_to_column('id') %>% | |
mutate( | |
id = gsub('SuperSMART_', 'S', id), | |
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) | |
) %>% | |
separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>% | |
mutate(., group = group_indices(., S)) | |
## id S R p x group | |
## 1 S003_T1_p555 S003 T1 p555 1 1 | |
## 2 S003_T2_p456 S003 T2 p456 2 1 | |
## 3 S004_T3_p785 S004 T3 p785 3 2 | |
## 4 S004_T4_p426 S004 T4 p426 4 2 | |
## 5 S027_T1_p112 S027 T1 p112 5 3 | |
## 6 S027_T2_p414 S027 T2 p414 6 3 | |
## 7 S042_T3_p155 S042 T3 p155 7 4 | |
## 8 S042_T5_p775 S042 T5 p775 8 4 | |
## If you really want it as a function: | |
normalize_data <- function(m, ..) { | |
as.data.frame(m, stringsAsFactors = FALSE) %>% | |
tibble::rownames_to_column('id') %>% | |
dplyr::mutate( | |
id = gsub('SuperSMART_', 'S', id), | |
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) | |
) %>% | |
tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>% | |
dplyr::mutate(., group = dplyr::group_indices(., S)) | |
} | |
So this is a groupped capture denoted by the parenthesis '(^S)(\d{2})(_)'
. There are 3 groups being captured. 1: (^S)
, 2:(\d{2})
, 3: (_)
. The first one says grab from the beginning (^
) and S
. The second group says grab after that where there are exactly 2 digits (\\d{2}
) and then the 3rd group says it must be followed by an underscore.
So S27_T2_p414
would be matched by this but S004_T3_p785
would not.
For the replacment of '\10\2\3'
....If it matches '(^S)(\d{2})(_)'
we can use perl = TRUE
to replace the group capturing (denoted by parenthesis above. The \1
corresponds to (^S)
; the \2
corresponds to (\d{2})
AND \3
goes with (_)
. We can insert things in between the capture groups. This technique is called backreference. In this case I insert an extra zero between the first capture group and the second to ensure all numbers have 3 digits. This makes an assumption that at most you have 3 digits in the string after S
.
Hi Tyler. Thanks for this solution again. can you please explain the gsub('(^S)(\d{2})(_)', '\10\2\3', id, perl = TRUE) line?
I think I understand what has been done in all other lines.
Regards