Last active
May 12, 2016 10:36
-
-
Save cth/21eefab23866d4d1b99d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
strandflip.R version 0.2 | |
By Christian Theil Have | |
with inspiration from SAS script by Aldi Kraja | |
DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS | |
PROGRAM. YOU RUN IT AT YOUR OWN RISK. | |
YOU CAN DISTRIBUTE this utility pgm in other collaborations | |
purpose: flip dosage if the alleles in original data are | |
different when compared to 1000G set of nucmtDNA | |
fix also the info file for the alleles | |
Running: | |
======== | |
The program is run using Rscript, e.g., | |
Rscript strandflip.R study=mystudy stddosage=mystudy.dosage stdinfo=mystudy.info dirout=/output/here ... | |
This will run the program reading `mystudy.dosage` and `mystudy.info` from the current | |
directory. It also expects to find the .csv files with reference data thay is | |
provided by Aldi here: https://dsgweb.wustl.edu/pleionet/chargemtdna/db/db.html | |
Upon completion, the script produces three files per chromosome analyzed in `dirout`: | |
`mystudy.chr??.info`: An updated INFO file (flipped appropriately to match reference) | |
`mystudy.chr??.dosage`: Updated dosage file (flipped appropriately to match reference) | |
`mystudy.chr??.mismatch`: File contained list of mismatched variants (discarded from further analysis) | |
Options: | |
======== | |
- startc= and endc= are chrom numbers of start and end ex. | |
one can select startc=1, and endc=22, or one may select | |
startc=22 and endc=22, etc. thus one may run one chrom | |
at a time in different servers if it is possible | |
- study= is your study label | |
- stddosage= is the study dosage tables for your study | |
rows are subjects and columns are markers in your study, | |
with one column as subject ID, content of a cell is dosage | |
- aref= is the reference allele of 1000G, as letter | |
- aalt= is the alternative allele in 1000G, as letter | |
- aaltaf= is allele frequency for ALT allele, in decimal | |
- bcoded= is the coded allele in your study, as letter | |
- bnoncoded= is the non-coded allele in your study, as letter | |
- bcodedaf= is the allele freq in your study for coded al. | |
- ambigSNPswitch=FALSE, if TRUE the module of flipping ambigous | |
SNPs becomes active. One turns it on if you see in the | |
plot produced from this pgm of study allele freq against | |
1000G allele freq in a direction against diagonal | |
- allowindelabbrev=FALSE, if TRUE then I/D alleles in your | |
INFO file are matched against long/short alleles in the | |
reference data and retained/flipped if one reference allele | |
is longer than the other. Only, if they are the same length | |
(e.g., a SNP) then indel variant will be discarded and logged | |
to a mismatched file. | |
- delta=, a value of the difference between allele freq of | |
your study and 1000G that indicates that alleles have to | |
be flipped. For example, if the allele freq diff is >= 0.4 | |
this says that your study may have A=0.3 while 1000G A=0.7 | |
or your study A=0.2 and 1000G A=0.8 etc. | |
Ambiguous SNPs are considered A and T , C and G where we | |
cannot distinguish which strand these alleles are coming | |
from. | |
- dirout= is a NEW directory to be CREATED | |
any time that you run this program. | |
Changelog | |
========= | |
Version 0.2 (Jan 4 2016) | |
======================== | |
Some issues were found in the first version of the strandflip.R program: | |
> In the SAS program, things that do not match (after we have checked all other possibilities such | |
> as reverse strand, complement etc) based on alleles they go to a file named "missmatched", and | |
> they are not included in the final version of the data to be used for the associations. | |
> > > - The same has to happen with the R program. | |
This now also happens in the R program: | |
1) A file "flipped100g/[study].[chromosome].mismatched" file is created. | |
2) The info/dosage files only contains non-mismatched variants | |
> - Christian has used long labels concatenated with underscores that in different versions of R will | |
> be misinterpreted: For example here is a partial strandflip.R program: | |
> The underscore in R is also assignment. Please use a different way of distinguishing the | |
> combined words as follow: for "ref_and_info" you can write "refAndInfo"; for the "info <- | |
> fields" you can translate it to "infoFields" etc. | |
I think this is only a emacs related problem. All variables are now named with "." as word separator. | |
> - The read.csv or read.table in strandflip.R have to be used carefully dependent on a study’s data | |
> format (TAB or CSV delimit). Christian, please provide a readme that emphasizes this idea. | |
> - Christian, in your strandflip.R, in the hg1000nucmtdnab138_c#.csv instead of using ID | |
> variable, please refer to rs138, pos138, chrom138, REF, ALT, altaf, because within the same file | |
> referring to ID variable (which is the same as rs138) may confuse analysts who normally think | |
> for ID as subjectID. | |
The script uses the rs138 id to identify variants from the reference files. | |
It also assumes that the id-column of the INFO file is called rs138 (rather than ID). | |
> Indel alleles may be given in single-character format, e.g., I/D for Insertion/Deletion, whereas | |
> the reference gives the full nucleotide sequence. Obviously, I/D alleles cannot be complemented | |
> and leads to errors from the program. | |
The program now produces only a warning when an allele cannot be complemented. These I/D | |
variants will be excluded in output INFO+dosage files as well as logged to the xxx.mismatched file. | |
However, when special option "allowindelabbrev" the program will allow and possible flip these I/D, | |
by comparing with reference allele length. | |
> Other improvements: | |
- Bug fix: misspelled the ambigSwitch argument, so it effectively was not used. | |
- Bug fix: Complementing complained because alleles were loaded as factors. Fixed by explicitly converting to character. | |
- Slightly better plots with different kinds of points to delineate the action taken by the program | |
for each individual variant. | |
- Reference database files from Aldi does not need to be unpacked (but can be). | |
Version 0.1 (20. Nov. 2015) | |
=========================== | |
First version of the script. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# strandflip.R | |
# By Christian Theil Have | |
# with inspiration from SAS script by Aldi Kraja | |
# | |
# See README file for details on running the program. | |
# | |
# DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS | |
# PROGRAM. YOU RUN IT AT YOUR OWN RISK: 11.10.2015 | |
# YOU CAN DISTRIBUTE this utility pgm in other collaborations | |
nucleotide.complement <- function(x) list("A"="T","T"="A","G"="C","C"="G")[[x]] | |
compl <- function(seq) { | |
try.compl <- do.call(paste0,lapply(strsplit(as.character(seq),"")[[1]],nucleotide.complement)) | |
if (length(try.compl) == 1) { | |
try.compl | |
} else { | |
warning(paste("Sequence ", seq, " cannot be complemented")) | |
} | |
} | |
alleles.are.identical <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a0 && ref.a1 == a1 | |
alleles.are.flipped <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a1 && ref.a1 == a0 | |
indel.is.identical <- function(ref.a0, ref.a1, a0, a1) | |
((a0 == "I" && a1 == "D" && nchar(ref.a0) > nchar(ref.a1)) || (a0 == "D" && a1 == "I" && nchar(ref.a0) < nchar(ref.a1))) | |
indel.is.flipped <- function(ref.a0, ref.a1, a0, a1) | |
(a0 == "I" && a1 == "D" && nchar(ref.a0) < nchar(ref.a1)) || (a0 == "D" && a1 == "I" && (nchar(ref.a0) > nchar(ref.a1))) | |
alleles.are.compliment <- function(ref.a0, ref.a1, a0, a1) { print(paste(ref.a0,ref.a1,a0,a1,sep=":")); compl(ref.a0) == compl(a0) && ref.a1 == compl(a1) } | |
alleles.are.compliment.and.flipped <- function(ref.a0, ref.a1, a0, a1) compl(ref.a0) == a1 && compl(ref.a1) == a0 | |
alleles.ambigous <- function(a0,a1) compl(a0)== a1 | |
# Tries to read file as tab-delimited, if this fails -- fall back to comma separated | |
read.delim.guess <- function(...) { | |
data<-read.table(...) | |
if (ncol(data)==1) | |
read.csv(...) | |
else | |
data | |
} | |
flip <- function( | |
startc=1, | |
endc=22, | |
study="test", | |
dirin=getwd(), | |
stddosage="test.dosage", | |
stdinfo="test.info", | |
aid="rs138", | |
aref="REF", | |
alt="ALT", | |
aaltaf="altaf", | |
bid="rs138", | |
bcoded="effectal", | |
bnoncoded="noneffectal", | |
bcodedaf="codedaf", | |
pos="pos", | |
ambigSNPswitch=FALSE, | |
allowindelabbrev=FALSE, | |
delta=0.4, | |
dirout=paste0(getwd(), "/flipped1000g/")) { | |
if (!file.exists(dirout)) | |
system(paste("mkdir -p", dirout)) | |
for(c in startc:endc) { | |
# load 1000G file (only column ID, REF, ALT, altaf) | |
ref.file = paste0(dirin, "/hg1000nucmtdnab138_c", c, ".csv") | |
if (file.exists(paste0(ref.file, ".gz"))) # File unpacked or not? | |
ref.file <- gzfile(paste0(ref.file, ".gz")) | |
ref.1KG <- read.csv(ref.file,h=T,stringsAsFactors=T)[,c(aid,aref,alt,aaltaf)] | |
ref.1KG <- ref.1KG | |
# read info file | |
info <- read.delim.guess(stdinfo,h=T,stringsAsFactors=F) | |
info.fields <- c(bid,bcoded,bnoncoded,bcodedaf) | |
stopifnot(all(info.fields %in% colnames(info))) | |
info <- info[,info.fields] | |
info[[aid]] <- info[[bid]] # for merging purposes | |
# Assumes no name-overlap between bcoded,bnoncoded,bcodedaf and aref,alt,aaltaf | |
ref.and.info <- merge(info, ref.1KG, by=aid) | |
ref.and.info$flip <- rep(FALSE, nrow(ref.and.info)) | |
ref.and.info$origin <- rep(NA, nrow(ref.and.info)) | |
# read dosages file | |
dosages <- read.delim.guess(stddosage,h=T) | |
for(snp in 1:nrow(ref.and.info)) { | |
a0 <- as.character(ref.and.info[snp,][[bnoncoded]]) | |
a1 <- as.character(ref.and.info[snp,][[bcoded]]) | |
ref.a0 <- as.character(ref.and.info[snp ,][[aref]]) | |
ref.a1 <- as.character(ref.and.info[snp ,][[alt]]) | |
frequency.mismatch = (abs(ref.and.info[[bcodedaf]]-ref.and.info[[aaltaf]]) > delta) | |
if (alleles.are.identical(ref.a0,ref.a1,a0,a1) || (allowindelabbrev && indel.is.identical(ref.a0, ref.a1, a0, a1))) { | |
ref.and.info[snp,]$origin <- "matched" | |
} else if ((allowindelabbrev && indel.is.flipped(ref.a0, ref.a1, a0, a1)) || alleles.are.flipped(ref.a0,ref.a1,a0,a1)) { | |
ref.and.info[snp,]$flip <- TRUE | |
ref.and.info[snp,]$origin <- "flipped" | |
dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ] | |
} else if (alleles.are.compliment(ref.a0,ref.a1,a0,a1)) { | |
ref.and.info[snp,]$origin <- "complement" | |
} else { | |
ref.and.info[snp,]$origin <- "mismatched" | |
} | |
if (ambigSNPswitch && alleles.are.ambigous(ref.a0,ref.a1,a0,a1) && frequency.mismatch) { | |
ref.and.info[snp,]$origin <- "ambigFlip" | |
dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ] | |
} | |
} | |
origin.types <- c("matched", "flipped", "complement", "mismatched", "ambigFlip") | |
ref.and.info$origin <- factor(ref.and.info$origin, levels=origin.types) | |
new.codedaf <- ifelse(ref.and.info$flip, 1-ref.and.info[[bcodedaf]],ref.and.info[[bcodedaf]]) | |
print(summary(ref.and.info)) | |
par(mfrow=c(1,2)) | |
pdf(paste0(dirout, "/", study, ".chr" , c, ".frequencies.pdf")) | |
plot(ref.and.info[[bcodedaf]],ref.and.info[[aaltaf]], main=paste(study, "chr", c, "before flip"), xlab="Study allele frequency", ylab="Reference allele frequency", pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5))) | |
plot(new.codedaf,ref.and.info[[aaltaf]],main=paste(study, "chr", c, "after flip"), xlab="Study allele frequency", ylab="Reference allele frequency", | |
pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5))) | |
dev.off() | |
ref.and.info[[bcodedaf]] <- new.codedaf | |
good.info <- ref.and.info[!(ref.and.info$origin=="mismatched"),] | |
good.dosages <- dosages[,which(colnames(dosages) %in% good.info[[aid]])] | |
print(paste("retained", nrow(good.info), "out of", nrow(ref.and.info), "variants")) | |
# Make updated/extended info file | |
write.table(good.info,paste0(dirout,"/",study,".chr", c, ".info"),row.names=F, col.names=T, quot=F, sep="\t") | |
# Make updated dosage file | |
write.table(good.dosages,paste0(dirout,"/",study,".chr", c, ".dosage"),row.names=T, col.names=T, quot=F,sep="\t") | |
# Write mismatched table | |
if (sum(ref.and.info$origin == "mismatched") > 0) { | |
write.table(ref.and.info[ref.and.info$origin == "mismatched",]$ID, paste0(dirout, "/", study, ".chr" , c, ".mismatched"),row.names=F,col.names=F, quot=F) | |
} | |
} | |
} | |
####################### Arguments ############################# | |
args <- commandArgs(trailingOnly=TRUE) | |
opts <- sapply(strsplit(args,"="), function(x) { x[2] } ) | |
names(opts) <- sapply(strsplit(args,"="), function(x) { x[1] } ) | |
opts.list <- as.list(opts) | |
for(opt in c("startc", "endc", "delta")) | |
if(!is.null(opts.list[[opt]])) | |
opts.list[[opt]] <- as.numeric(opts.list[[opt]]) | |
for(opt in c("ambigSNPswitch", "allowindelabbrev")) | |
if(!is.null(opts.list[[opt]])) | |
opts.list[[opt]] <- as.logical(opts.list[[opt]]) | |
print(str(opts.list)) | |
flip.res <- do.call(flip,opts.list) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment