Skip to content

Instantly share code, notes, and snippets.

@cth
Last active May 12, 2016 10:36
Show Gist options
  • Save cth/21eefab23866d4d1b99d to your computer and use it in GitHub Desktop.
Save cth/21eefab23866d4d1b99d to your computer and use it in GitHub Desktop.
strandflip.R version 0.2
By Christian Theil Have
with inspiration from SAS script by Aldi Kraja
DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
PROGRAM. YOU RUN IT AT YOUR OWN RISK.
YOU CAN DISTRIBUTE this utility pgm in other collaborations
purpose: flip dosage if the alleles in original data are
different when compared to 1000G set of nucmtDNA
fix also the info file for the alleles
Running:
========
The program is run using Rscript, e.g.,
Rscript strandflip.R study=mystudy stddosage=mystudy.dosage stdinfo=mystudy.info dirout=/output/here ...
This will run the program reading `mystudy.dosage` and `mystudy.info` from the current
directory. It also expects to find the .csv files with reference data thay is
provided by Aldi here: https://dsgweb.wustl.edu/pleionet/chargemtdna/db/db.html
Upon completion, the script produces three files per chromosome analyzed in `dirout`:
`mystudy.chr??.info`: An updated INFO file (flipped appropriately to match reference)
`mystudy.chr??.dosage`: Updated dosage file (flipped appropriately to match reference)
`mystudy.chr??.mismatch`: File contained list of mismatched variants (discarded from further analysis)
Options:
========
- startc= and endc= are chrom numbers of start and end ex.
one can select startc=1, and endc=22, or one may select
startc=22 and endc=22, etc. thus one may run one chrom
at a time in different servers if it is possible
- study= is your study label
- stddosage= is the study dosage tables for your study
rows are subjects and columns are markers in your study,
with one column as subject ID, content of a cell is dosage
- aref= is the reference allele of 1000G, as letter
- aalt= is the alternative allele in 1000G, as letter
- aaltaf= is allele frequency for ALT allele, in decimal
- bcoded= is the coded allele in your study, as letter
- bnoncoded= is the non-coded allele in your study, as letter
- bcodedaf= is the allele freq in your study for coded al.
- ambigSNPswitch=FALSE, if TRUE the module of flipping ambigous
SNPs becomes active. One turns it on if you see in the
plot produced from this pgm of study allele freq against
1000G allele freq in a direction against diagonal
- allowindelabbrev=FALSE, if TRUE then I/D alleles in your
INFO file are matched against long/short alleles in the
reference data and retained/flipped if one reference allele
is longer than the other. Only, if they are the same length
(e.g., a SNP) then indel variant will be discarded and logged
to a mismatched file.
- delta=, a value of the difference between allele freq of
your study and 1000G that indicates that alleles have to
be flipped. For example, if the allele freq diff is >= 0.4
this says that your study may have A=0.3 while 1000G A=0.7
or your study A=0.2 and 1000G A=0.8 etc.
Ambiguous SNPs are considered A and T , C and G where we
cannot distinguish which strand these alleles are coming
from.
- dirout= is a NEW directory to be CREATED
any time that you run this program.
Changelog
=========
Version 0.2 (Jan 4 2016)
========================
Some issues were found in the first version of the strandflip.R program:
> In the SAS program, things that do not match (after we have checked all other possibilities such
> as reverse strand, complement etc) based on alleles they go to a file named "missmatched", and
> they are not included in the final version of the data to be used for the associations.
> > > - The same has to happen with the R program.
This now also happens in the R program:
1) A file "flipped100g/[study].[chromosome].mismatched" file is created.
2) The info/dosage files only contains non-mismatched variants
> - Christian has used long labels concatenated with underscores that in different versions of R will
> be misinterpreted: For example here is a partial strandflip.R program:
> The underscore in R is also assignment. Please use a different way of distinguishing the
> combined words as follow: for "ref_and_info" you can write "refAndInfo"; for the "info <-
> fields" you can translate it to "infoFields" etc.
I think this is only a emacs related problem. All variables are now named with "." as word separator.
> - The read.csv or read.table in strandflip.R have to be used carefully dependent on a study’s data
> format (TAB or CSV delimit). Christian, please provide a readme that emphasizes this idea.
> - Christian, in your strandflip.R, in the hg1000nucmtdnab138_c#.csv instead of using ID
> variable, please refer to rs138, pos138, chrom138, REF, ALT, altaf, because within the same file
> referring to ID variable (which is the same as rs138) may confuse analysts who normally think
> for ID as subjectID.
The script uses the rs138 id to identify variants from the reference files.
It also assumes that the id-column of the INFO file is called rs138 (rather than ID).
> Indel alleles may be given in single-character format, e.g., I/D for Insertion/Deletion, whereas
> the reference gives the full nucleotide sequence. Obviously, I/D alleles cannot be complemented
> and leads to errors from the program.
The program now produces only a warning when an allele cannot be complemented. These I/D
variants will be excluded in output INFO+dosage files as well as logged to the xxx.mismatched file.
However, when special option "allowindelabbrev" the program will allow and possible flip these I/D,
by comparing with reference allele length.
> Other improvements:
- Bug fix: misspelled the ambigSwitch argument, so it effectively was not used.
- Bug fix: Complementing complained because alleles were loaded as factors. Fixed by explicitly converting to character.
- Slightly better plots with different kinds of points to delineate the action taken by the program
for each individual variant.
- Reference database files from Aldi does not need to be unpacked (but can be).
Version 0.1 (20. Nov. 2015)
===========================
First version of the script.
# strandflip.R
# By Christian Theil Have
# with inspiration from SAS script by Aldi Kraja
#
# See README file for details on running the program.
#
# DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
# PROGRAM. YOU RUN IT AT YOUR OWN RISK: 11.10.2015
# YOU CAN DISTRIBUTE this utility pgm in other collaborations
nucleotide.complement <- function(x) list("A"="T","T"="A","G"="C","C"="G")[[x]]
compl <- function(seq) {
try.compl <- do.call(paste0,lapply(strsplit(as.character(seq),"")[[1]],nucleotide.complement))
if (length(try.compl) == 1) {
try.compl
} else {
warning(paste("Sequence ", seq, " cannot be complemented"))
}
}
alleles.are.identical <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a0 && ref.a1 == a1
alleles.are.flipped <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a1 && ref.a1 == a0
indel.is.identical <- function(ref.a0, ref.a1, a0, a1)
((a0 == "I" && a1 == "D" && nchar(ref.a0) > nchar(ref.a1)) || (a0 == "D" && a1 == "I" && nchar(ref.a0) < nchar(ref.a1)))
indel.is.flipped <- function(ref.a0, ref.a1, a0, a1)
(a0 == "I" && a1 == "D" && nchar(ref.a0) < nchar(ref.a1)) || (a0 == "D" && a1 == "I" && (nchar(ref.a0) > nchar(ref.a1)))
alleles.are.compliment <- function(ref.a0, ref.a1, a0, a1) { print(paste(ref.a0,ref.a1,a0,a1,sep=":")); compl(ref.a0) == compl(a0) && ref.a1 == compl(a1) }
alleles.are.compliment.and.flipped <- function(ref.a0, ref.a1, a0, a1) compl(ref.a0) == a1 && compl(ref.a1) == a0
alleles.ambigous <- function(a0,a1) compl(a0)== a1
# Tries to read file as tab-delimited, if this fails -- fall back to comma separated
read.delim.guess <- function(...) {
data<-read.table(...)
if (ncol(data)==1)
read.csv(...)
else
data
}
flip <- function(
startc=1,
endc=22,
study="test",
dirin=getwd(),
stddosage="test.dosage",
stdinfo="test.info",
aid="rs138",
aref="REF",
alt="ALT",
aaltaf="altaf",
bid="rs138",
bcoded="effectal",
bnoncoded="noneffectal",
bcodedaf="codedaf",
pos="pos",
ambigSNPswitch=FALSE,
allowindelabbrev=FALSE,
delta=0.4,
dirout=paste0(getwd(), "/flipped1000g/")) {
if (!file.exists(dirout))
system(paste("mkdir -p", dirout))
for(c in startc:endc) {
# load 1000G file (only column ID, REF, ALT, altaf)
ref.file = paste0(dirin, "/hg1000nucmtdnab138_c", c, ".csv")
if (file.exists(paste0(ref.file, ".gz"))) # File unpacked or not?
ref.file <- gzfile(paste0(ref.file, ".gz"))
ref.1KG <- read.csv(ref.file,h=T,stringsAsFactors=T)[,c(aid,aref,alt,aaltaf)]
ref.1KG <- ref.1KG
# read info file
info <- read.delim.guess(stdinfo,h=T,stringsAsFactors=F)
info.fields <- c(bid,bcoded,bnoncoded,bcodedaf)
stopifnot(all(info.fields %in% colnames(info)))
info <- info[,info.fields]
info[[aid]] <- info[[bid]] # for merging purposes
# Assumes no name-overlap between bcoded,bnoncoded,bcodedaf and aref,alt,aaltaf
ref.and.info <- merge(info, ref.1KG, by=aid)
ref.and.info$flip <- rep(FALSE, nrow(ref.and.info))
ref.and.info$origin <- rep(NA, nrow(ref.and.info))
# read dosages file
dosages <- read.delim.guess(stddosage,h=T)
for(snp in 1:nrow(ref.and.info)) {
a0 <- as.character(ref.and.info[snp,][[bnoncoded]])
a1 <- as.character(ref.and.info[snp,][[bcoded]])
ref.a0 <- as.character(ref.and.info[snp ,][[aref]])
ref.a1 <- as.character(ref.and.info[snp ,][[alt]])
frequency.mismatch = (abs(ref.and.info[[bcodedaf]]-ref.and.info[[aaltaf]]) > delta)
if (alleles.are.identical(ref.a0,ref.a1,a0,a1) || (allowindelabbrev && indel.is.identical(ref.a0, ref.a1, a0, a1))) {
ref.and.info[snp,]$origin <- "matched"
} else if ((allowindelabbrev && indel.is.flipped(ref.a0, ref.a1, a0, a1)) || alleles.are.flipped(ref.a0,ref.a1,a0,a1)) {
ref.and.info[snp,]$flip <- TRUE
ref.and.info[snp,]$origin <- "flipped"
dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ]
} else if (alleles.are.compliment(ref.a0,ref.a1,a0,a1)) {
ref.and.info[snp,]$origin <- "complement"
} else {
ref.and.info[snp,]$origin <- "mismatched"
}
if (ambigSNPswitch && alleles.are.ambigous(ref.a0,ref.a1,a0,a1) && frequency.mismatch) {
ref.and.info[snp,]$origin <- "ambigFlip"
dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ]
}
}
origin.types <- c("matched", "flipped", "complement", "mismatched", "ambigFlip")
ref.and.info$origin <- factor(ref.and.info$origin, levels=origin.types)
new.codedaf <- ifelse(ref.and.info$flip, 1-ref.and.info[[bcodedaf]],ref.and.info[[bcodedaf]])
print(summary(ref.and.info))
par(mfrow=c(1,2))
pdf(paste0(dirout, "/", study, ".chr" , c, ".frequencies.pdf"))
plot(ref.and.info[[bcodedaf]],ref.and.info[[aaltaf]], main=paste(study, "chr", c, "before flip"), xlab="Study allele frequency", ylab="Reference allele frequency", pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
plot(new.codedaf,ref.and.info[[aaltaf]],main=paste(study, "chr", c, "after flip"), xlab="Study allele frequency", ylab="Reference allele frequency",
pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
dev.off()
ref.and.info[[bcodedaf]] <- new.codedaf
good.info <- ref.and.info[!(ref.and.info$origin=="mismatched"),]
good.dosages <- dosages[,which(colnames(dosages) %in% good.info[[aid]])]
print(paste("retained", nrow(good.info), "out of", nrow(ref.and.info), "variants"))
# Make updated/extended info file
write.table(good.info,paste0(dirout,"/",study,".chr", c, ".info"),row.names=F, col.names=T, quot=F, sep="\t")
# Make updated dosage file
write.table(good.dosages,paste0(dirout,"/",study,".chr", c, ".dosage"),row.names=T, col.names=T, quot=F,sep="\t")
# Write mismatched table
if (sum(ref.and.info$origin == "mismatched") > 0) {
write.table(ref.and.info[ref.and.info$origin == "mismatched",]$ID, paste0(dirout, "/", study, ".chr" , c, ".mismatched"),row.names=F,col.names=F, quot=F)
}
}
}
####################### Arguments #############################
args <- commandArgs(trailingOnly=TRUE)
opts <- sapply(strsplit(args,"="), function(x) { x[2] } )
names(opts) <- sapply(strsplit(args,"="), function(x) { x[1] } )
opts.list <- as.list(opts)
for(opt in c("startc", "endc", "delta"))
if(!is.null(opts.list[[opt]]))
opts.list[[opt]] <- as.numeric(opts.list[[opt]])
for(opt in c("ambigSNPswitch", "allowindelabbrev"))
if(!is.null(opts.list[[opt]]))
opts.list[[opt]] <- as.logical(opts.list[[opt]])
print(str(opts.list))
flip.res <- do.call(flip,opts.list)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment