cth · May 12, 2016 10:36
diff --git a/README b/README
 strandflip.R version 0.2
 By Christian Theil Have
 with inspiration from SAS script by Aldi Kraja

 DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
 PROGRAM. YOU RUN IT AT YOUR OWN RISK.
 YOU CAN DISTRIBUTE this utility pgm in other collaborations

 purpose: flip dosage if the alleles in original data are
 different when compared to 1000G set of nucmtDNA
 fix also the info file for the alleles

 Running:
 ========

 The program is run using Rscript, e.g.,

 Rscript strandflip.R study=mystudy stddosage=mystudy.dosage  stdinfo=mystudy.info dirout=/output/here ...

 This will run the program reading `mystudy.dosage` and `mystudy.info` from the current
 directory. It also expects to find the .csv files with reference data thay is 
 provided by Aldi here: https://dsgweb.wustl.edu/pleionet/chargemtdna/db/db.html

 Upon completion, the script produces three files per chromosome analyzed in `dirout`:
 `mystudy.chr??.info`: An updated INFO file  (flipped appropriately to match reference)
 `mystudy.chr??.dosage`: Updated dosage file (flipped appropriately to match reference) 
 `mystudy.chr??.mismatch`: File contained list of mismatched variants (discarded from further analysis) 

 Options:
 ========

 - startc= and endc= are chrom numbers of start and end ex.
  one can select startc=1, and endc=22, or one may select
   startc=22 and endc=22, etc. thus one may run one chrom
   at a time in different servers if it is possible
                                                            
 - study= is your study label                               

 - stddosage= is the study dosage tables for your study     
   rows are subjects and columns are markers in your study, 
   with one column as subject ID, content of a cell is dosage
                                                            
 - aref= is the reference allele of 1000G, as letter        
                                                            
 - aalt= is the alternative allele in 1000G, as letter      
                                                            
 - aaltaf= is allele frequency for ALT allele, in decimal   
                                                            
 - bcoded= is the coded allele in your study, as letter     
                                                            
 - bnoncoded= is the non-coded allele in your study, as letter 
                                                            
 - bcodedaf= is the allele freq in your study for coded al. 
                                                            
 - ambigSNPswitch=FALSE, if TRUE the module of flipping ambigous
   SNPs becomes active. One turns it on if you see in the   
   plot produced from this pgm of study allele freq against 
   1000G allele freq in a direction against diagonal

 - allowindelabbrev=FALSE, if TRUE then I/D alleles in your
   INFO file are matched against long/short alleles in the 
   reference data and retained/flipped if one reference allele
   is longer than the other. Only, if they are the same length
   (e.g., a SNP) then indel variant will be discarded and logged
   to a mismatched file.  
                                                            
 - delta=, a value of the difference between allele freq of 
   your study and 1000G that indicates that alleles have to 
   be flipped. For example, if the allele freq diff is >= 0.4
   this says that your study may have A=0.3 while 1000G A=0.7
   or your study A=0.2 and 1000G A=0.8 etc.                 
   Ambiguous SNPs are considered A and T , C and G where we 
   cannot distinguish which strand these alleles are coming 
   from.                                                    
                                                            
 - dirout= is a NEW directory to be CREATED 
   any time that you run this program.

                                                            
 Changelog
 =========

 Version 0.2 (Jan 4 2016)
 ========================

 Some issues were found in the first version of the strandflip.R program:

 > In the SAS program, things that do not match (after we have checked all other possibilities such 
 > as reverse strand, complement etc) based on alleles they go to a file named "missmatched", and 
 > they are not included in the final version of the data to be used for the associations.
 > > > - The same has to happen with the R program. 

 This now also happens in the R program: 
 1) A file "flipped100g/[study].[chromosome].mismatched" file is created.
 2) The info/dosage files only contains non-mismatched variants 

 > - Christian has used long labels concatenated with underscores that in different versions of R will
 > be misinterpreted: For example here is a partial strandflip.R program:
 > The underscore in R is also assignment. Please use a different way of distinguishing the 
 > combined words as follow: for "ref_and_info" you can write "refAndInfo"; for the "info <-
 > fields" you can translate it to "infoFields" etc.

 I think this is only a emacs related problem. All variables are now named with "." as word separator.

 > - The read.csv or read.table in strandflip.R have to be used carefully dependent on a study’s data 
 > format (TAB or CSV delimit). Christian, please provide a readme that emphasizes this idea.
 > - Christian, in your strandflip.R, in the hg1000nucmtdnab138_c#.csv instead of using ID 
 > variable, please refer to rs138, pos138, chrom138, REF, ALT, altaf, because within the same file 
 > referring to ID variable (which is the same as rs138) may confuse analysts who normally think 
 > for ID as subjectID.

 The script uses the rs138 id to identify variants from the reference files.
 It also assumes that the id-column of the INFO file is called rs138 (rather than ID).

 > Indel alleles may be given in single-character format, e.g., I/D for Insertion/Deletion, whereas
 > the reference gives the full nucleotide sequence. Obviously, I/D alleles cannot be complemented
 > and leads to errors from the program.

 The program now produces only a warning when an allele cannot be complemented. These I/D
 variants will be excluded in output INFO+dosage files as well as logged to the xxx.mismatched file.
 However, when special option "allowindelabbrev" the program will allow and possible flip these I/D, 
 by comparing with reference allele length.

 > Other improvements:
 - Bug fix: misspelled the ambigSwitch argument, so it effectively was not used.
 - Bug fix: Complementing complained because alleles were loaded as factors. Fixed by explicitly converting to character.
 - Slightly better plots with different kinds of points to delineate the action taken by the program
  for each individual variant.
 - Reference database files from Aldi does not need to be unpacked (but can be).

 Version 0.1 (20. Nov. 2015)
 ===========================

 First version of the script.
diff --git a/strandflip.R b/strandflip.R
 # 	strandflip.R
 # 	By Christian Theil Have
 #   with inspiration from SAS script by Aldi Kraja                                              
 # 
 # See README file for details on running the program.
 #                                                            
 # DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
 # PROGRAM. YOU RUN IT AT YOUR OWN RISK: 11.10.2015           
 # YOU CAN DISTRIBUTE this utility pgm in other collaborations

 nucleotide.complement <- function(x) list("A"="T","T"="A","G"="C","C"="G")[[x]]
 compl <- function(seq) { 
 	try.compl <- do.call(paste0,lapply(strsplit(as.character(seq),"")[[1]],nucleotide.complement))
 	if (length(try.compl) == 1) {
 		try.compl
 	} else {
 		warning(paste("Sequence ", seq, " cannot be complemented"))
 	}
 }


 alleles.are.identical <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a0 && ref.a1 == a1

 alleles.are.flipped <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a1 && ref.a1 == a0

 indel.is.identical <- function(ref.a0, ref.a1, a0, a1)
 	((a0 == "I" && a1 == "D" && nchar(ref.a0) > nchar(ref.a1)) || (a0 == "D" && a1 == "I" && nchar(ref.a0) < nchar(ref.a1)))

 indel.is.flipped <- function(ref.a0, ref.a1, a0, a1)
 	(a0 == "I" && a1 == "D" && nchar(ref.a0) < nchar(ref.a1)) || (a0 == "D" && a1 == "I" && (nchar(ref.a0) > nchar(ref.a1)))

 alleles.are.compliment <- function(ref.a0, ref.a1, a0, a1) { print(paste(ref.a0,ref.a1,a0,a1,sep=":")); compl(ref.a0) == compl(a0) && ref.a1 == compl(a1) }

 alleles.are.compliment.and.flipped <- function(ref.a0, ref.a1, a0, a1) compl(ref.a0) == a1 && compl(ref.a1) == a0

 alleles.ambigous <- function(a0,a1) compl(a0)== a1 

 # Tries to read file as tab-delimited, if this fails -- fall back to comma separated
 read.delim.guess <- function(...) {
 	data<-read.table(...)
 	if (ncol(data)==1)
 		read.csv(...)
 	else
 		data
 }

 flip <- function(
 		startc=1, 
 		endc=22, 
 		study="test", 
 		dirin=getwd(),
 		stddosage="test.dosage",
 		stdinfo="test.info",
 		aid="rs138", 
 		aref="REF",
 		alt="ALT",
 		aaltaf="altaf",
 		bid="rs138",
 		bcoded="effectal",
 		bnoncoded="noneffectal",
 		bcodedaf="codedaf",
 		pos="pos",
 		ambigSNPswitch=FALSE,
 		allowindelabbrev=FALSE,
 		delta=0.4,
 		dirout=paste0(getwd(), "/flipped1000g/")) {

 	if (!file.exists(dirout))
 		system(paste("mkdir -p", dirout))

 	for(c in startc:endc) {
 		# load 1000G file (only column ID, REF, ALT, altaf)
 		ref.file = paste0(dirin, "/hg1000nucmtdnab138_c", c, ".csv") 
 		if (file.exists(paste0(ref.file, ".gz"))) # File unpacked or not?
 			ref.file <- gzfile(paste0(ref.file, ".gz"))
 		ref.1KG <- read.csv(ref.file,h=T,stringsAsFactors=T)[,c(aid,aref,alt,aaltaf)]
 		ref.1KG <- ref.1KG

 		# read info file
 		info <- read.delim.guess(stdinfo,h=T,stringsAsFactors=F)
 		info.fields <- c(bid,bcoded,bnoncoded,bcodedaf)

 		stopifnot(all(info.fields %in% colnames(info)))
 		info <- info[,info.fields]
 		info[[aid]] <- info[[bid]] # for merging purposes

 		# Assumes no name-overlap between bcoded,bnoncoded,bcodedaf and  aref,alt,aaltaf
 		ref.and.info <- merge(info, ref.1KG, by=aid)

 		ref.and.info$flip <- rep(FALSE, nrow(ref.and.info))
 		ref.and.info$origin <- rep(NA, nrow(ref.and.info))

 		# read dosages file
 		dosages <- read.delim.guess(stddosage,h=T)

 		for(snp in 1:nrow(ref.and.info)) {
 			a0 <- as.character(ref.and.info[snp,][[bnoncoded]])
 			a1 <- as.character(ref.and.info[snp,][[bcoded]])
 			ref.a0 <- as.character(ref.and.info[snp ,][[aref]])
 			ref.a1 <- as.character(ref.and.info[snp ,][[alt]])

 			frequency.mismatch = (abs(ref.and.info[[bcodedaf]]-ref.and.info[[aaltaf]]) > delta)

 			if (alleles.are.identical(ref.a0,ref.a1,a0,a1) || (allowindelabbrev && indel.is.identical(ref.a0, ref.a1, a0, a1))) {
 					ref.and.info[snp,]$origin <- "matched"
 			} else if ((allowindelabbrev && indel.is.flipped(ref.a0, ref.a1, a0, a1)) || alleles.are.flipped(ref.a0,ref.a1,a0,a1)) {
 					ref.and.info[snp,]$flip <- TRUE
 					ref.and.info[snp,]$origin <- "flipped"
 					dosages[, ref.and.info[snp,][[aid]] ]  <- 2 - dosages[, ref.and.info[snp,][[aid]] ] 
 			} else if (alleles.are.compliment(ref.a0,ref.a1,a0,a1)) {
 					ref.and.info[snp,]$origin <- "complement"
 			} else {
 					ref.and.info[snp,]$origin <- "mismatched"
 			}

 			if (ambigSNPswitch && alleles.are.ambigous(ref.a0,ref.a1,a0,a1) && frequency.mismatch) {
 					ref.and.info[snp,]$origin <- "ambigFlip"
 					dosages[, ref.and.info[snp,][[aid]] ]  <- 2 - dosages[, ref.and.info[snp,][[aid]] ] 
 			} 

 		}

 		origin.types <- c("matched", "flipped", "complement", "mismatched", "ambigFlip")
 		ref.and.info$origin <- factor(ref.and.info$origin, levels=origin.types)

 		new.codedaf <- ifelse(ref.and.info$flip, 1-ref.and.info[[bcodedaf]],ref.and.info[[bcodedaf]])
 		print(summary(ref.and.info))

 		par(mfrow=c(1,2))
 		pdf(paste0(dirout, "/", study, ".chr" , c, ".frequencies.pdf"))
 		plot(ref.and.info[[bcodedaf]],ref.and.info[[aaltaf]], main=paste(study, "chr", c, "before flip"), xlab="Study allele frequency", ylab="Reference allele frequency", pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
 		plot(new.codedaf,ref.and.info[[aaltaf]],main=paste(study, "chr", c, "after flip"), xlab="Study allele frequency", ylab="Reference allele frequency", 
 			pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
 		dev.off()

 		ref.and.info[[bcodedaf]] <- new.codedaf

 		good.info <- ref.and.info[!(ref.and.info$origin=="mismatched"),] 
 		good.dosages <- dosages[,which(colnames(dosages) %in% good.info[[aid]])]

 		print(paste("retained", nrow(good.info), "out of", nrow(ref.and.info), "variants"))

 		# Make updated/extended info file 
 		write.table(good.info,paste0(dirout,"/",study,".chr", c, ".info"),row.names=F, col.names=T, quot=F, sep="\t")
 	
 		# Make updated dosage file
 		write.table(good.dosages,paste0(dirout,"/",study,".chr", c, ".dosage"),row.names=T, col.names=T, quot=F,sep="\t")

 		# Write mismatched table 
 		if (sum(ref.and.info$origin == "mismatched") > 0) {
 			write.table(ref.and.info[ref.and.info$origin == "mismatched",]$ID, paste0(dirout, "/", study, ".chr" , c, ".mismatched"),row.names=F,col.names=F, quot=F)
 		}
 	}
 }

 ####################### Arguments #############################

 args <- commandArgs(trailingOnly=TRUE)
 opts <- sapply(strsplit(args,"="), function(x) { x[2] } )
 names(opts) <- sapply(strsplit(args,"="), function(x) { x[1] } )


 opts.list <- as.list(opts)

 for(opt in c("startc", "endc", "delta"))  
 	if(!is.null(opts.list[[opt]])) 
 		opts.list[[opt]] <- as.numeric(opts.list[[opt]])

 for(opt in c("ambigSNPswitch", "allowindelabbrev"))
 	if(!is.null(opts.list[[opt]])) 
 		opts.list[[opt]] <- as.logical(opts.list[[opt]])

 print(str(opts.list))

 flip.res <- do.call(flip,opts.list)
	strandflip.R version 0.2
	By Christian Theil Have
	with inspiration from SAS script by Aldi Kraja

	DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
	PROGRAM. YOU RUN IT AT YOUR OWN RISK.
	YOU CAN DISTRIBUTE this utility pgm in other collaborations

	purpose: flip dosage if the alleles in original data are
	different when compared to 1000G set of nucmtDNA
	fix also the info file for the alleles

	Running:
	========

	The program is run using Rscript, e.g.,

	Rscript strandflip.R study=mystudy stddosage=mystudy.dosage stdinfo=mystudy.info dirout=/output/here ...

	This will run the program reading `mystudy.dosage` and `mystudy.info` from the current
	directory. It also expects to find the .csv files with reference data thay is
	provided by Aldi here: https://dsgweb.wustl.edu/pleionet/chargemtdna/db/db.html

	Upon completion, the script produces three files per chromosome analyzed in `dirout`:
	`mystudy.chr??.info`: An updated INFO file (flipped appropriately to match reference)
	`mystudy.chr??.dosage`: Updated dosage file (flipped appropriately to match reference)
	`mystudy.chr??.mismatch`: File contained list of mismatched variants (discarded from further analysis)

	Options:
	========

	- startc= and endc= are chrom numbers of start and end ex.
	one can select startc=1, and endc=22, or one may select
	startc=22 and endc=22, etc. thus one may run one chrom
	at a time in different servers if it is possible

	- study= is your study label

	- stddosage= is the study dosage tables for your study
	rows are subjects and columns are markers in your study,
	with one column as subject ID, content of a cell is dosage

	- aref= is the reference allele of 1000G, as letter

	- aalt= is the alternative allele in 1000G, as letter

	- aaltaf= is allele frequency for ALT allele, in decimal

	- bcoded= is the coded allele in your study, as letter

	- bnoncoded= is the non-coded allele in your study, as letter

	- bcodedaf= is the allele freq in your study for coded al.

	- ambigSNPswitch=FALSE, if TRUE the module of flipping ambigous
	SNPs becomes active. One turns it on if you see in the
	plot produced from this pgm of study allele freq against
	1000G allele freq in a direction against diagonal

	- allowindelabbrev=FALSE, if TRUE then I/D alleles in your
	INFO file are matched against long/short alleles in the
	reference data and retained/flipped if one reference allele
	is longer than the other. Only, if they are the same length
	(e.g., a SNP) then indel variant will be discarded and logged
	to a mismatched file.

	- delta=, a value of the difference between allele freq of
	your study and 1000G that indicates that alleles have to
	be flipped. For example, if the allele freq diff is >= 0.4
	this says that your study may have A=0.3 while 1000G A=0.7
	or your study A=0.2 and 1000G A=0.8 etc.
	Ambiguous SNPs are considered A and T , C and G where we
	cannot distinguish which strand these alleles are coming
	from.

	- dirout= is a NEW directory to be CREATED
	any time that you run this program.


	Changelog
	=========

	Version 0.2 (Jan 4 2016)
	========================

	Some issues were found in the first version of the strandflip.R program:

	> In the SAS program, things that do not match (after we have checked all other possibilities such
	> as reverse strand, complement etc) based on alleles they go to a file named "missmatched", and
	> they are not included in the final version of the data to be used for the associations.
	> > > - The same has to happen with the R program.

	This now also happens in the R program:
	1) A file "flipped100g/[study].[chromosome].mismatched" file is created.
	2) The info/dosage files only contains non-mismatched variants

	> - Christian has used long labels concatenated with underscores that in different versions of R will
	> be misinterpreted: For example here is a partial strandflip.R program:
	> The underscore in R is also assignment. Please use a different way of distinguishing the
	> combined words as follow: for "ref_and_info" you can write "refAndInfo"; for the "info <-
	> fields" you can translate it to "infoFields" etc.

	I think this is only a emacs related problem. All variables are now named with "." as word separator.

	> - The read.csv or read.table in strandflip.R have to be used carefully dependent on a study’s data
	> format (TAB or CSV delimit). Christian, please provide a readme that emphasizes this idea.
	> - Christian, in your strandflip.R, in the hg1000nucmtdnab138_c#.csv instead of using ID
	> variable, please refer to rs138, pos138, chrom138, REF, ALT, altaf, because within the same file
	> referring to ID variable (which is the same as rs138) may confuse analysts who normally think
	> for ID as subjectID.

	The script uses the rs138 id to identify variants from the reference files.
	It also assumes that the id-column of the INFO file is called rs138 (rather than ID).

	> Indel alleles may be given in single-character format, e.g., I/D for Insertion/Deletion, whereas
	> the reference gives the full nucleotide sequence. Obviously, I/D alleles cannot be complemented
	> and leads to errors from the program.

	The program now produces only a warning when an allele cannot be complemented. These I/D
	variants will be excluded in output INFO+dosage files as well as logged to the xxx.mismatched file.
	However, when special option "allowindelabbrev" the program will allow and possible flip these I/D,
	by comparing with reference allele length.

	> Other improvements:
	- Bug fix: misspelled the ambigSwitch argument, so it effectively was not used.
	- Bug fix: Complementing complained because alleles were loaded as factors. Fixed by explicitly converting to character.
	- Slightly better plots with different kinds of points to delineate the action taken by the program
	for each individual variant.
	- Reference database files from Aldi does not need to be unpacked (but can be).

	Version 0.1 (20. Nov. 2015)
	===========================

	First version of the script.
	# strandflip.R
	# By Christian Theil Have
	# with inspiration from SAS script by Aldi Kraja
	#
	# See README file for details on running the program.
	#
	# DISCLAIMER: NO WARRANTIES WHATSOEVER ARE PROVIDED WITH THIS
	# PROGRAM. YOU RUN IT AT YOUR OWN RISK: 11.10.2015
	# YOU CAN DISTRIBUTE this utility pgm in other collaborations

	nucleotide.complement <- function(x) list("A"="T","T"="A","G"="C","C"="G")[[x]]
	compl <- function(seq) {
	try.compl <- do.call(paste0,lapply(strsplit(as.character(seq),"")[[1]],nucleotide.complement))
	if (length(try.compl) == 1) {
	try.compl
	} else {
	warning(paste("Sequence ", seq, " cannot be complemented"))
	}
	}


	alleles.are.identical <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a0 && ref.a1 == a1

	alleles.are.flipped <- function(ref.a0, ref.a1, a0, a1) ref.a0 == a1 && ref.a1 == a0

	indel.is.identical <- function(ref.a0, ref.a1, a0, a1)
	((a0 == "I" && a1 == "D" && nchar(ref.a0) > nchar(ref.a1)) \|\| (a0 == "D" && a1 == "I" && nchar(ref.a0) < nchar(ref.a1)))

	indel.is.flipped <- function(ref.a0, ref.a1, a0, a1)
	(a0 == "I" && a1 == "D" && nchar(ref.a0) < nchar(ref.a1)) \|\| (a0 == "D" && a1 == "I" && (nchar(ref.a0) > nchar(ref.a1)))

	alleles.are.compliment <- function(ref.a0, ref.a1, a0, a1) { print(paste(ref.a0,ref.a1,a0,a1,sep=":")); compl(ref.a0) == compl(a0) && ref.a1 == compl(a1) }

	alleles.are.compliment.and.flipped <- function(ref.a0, ref.a1, a0, a1) compl(ref.a0) == a1 && compl(ref.a1) == a0

	alleles.ambigous <- function(a0,a1) compl(a0)== a1

	# Tries to read file as tab-delimited, if this fails -- fall back to comma separated
	read.delim.guess <- function(...) {
	data<-read.table(...)
	if (ncol(data)==1)
	read.csv(...)
	else
	data
	}

	flip <- function(
	startc=1,
	endc=22,
	study="test",
	dirin=getwd(),
	stddosage="test.dosage",
	stdinfo="test.info",
	aid="rs138",
	aref="REF",
	alt="ALT",
	aaltaf="altaf",
	bid="rs138",
	bcoded="effectal",
	bnoncoded="noneffectal",
	bcodedaf="codedaf",
	pos="pos",
	ambigSNPswitch=FALSE,
	allowindelabbrev=FALSE,
	delta=0.4,
	dirout=paste0(getwd(), "/flipped1000g/")) {

	if (!file.exists(dirout))
	system(paste("mkdir -p", dirout))

	for(c in startc:endc) {
	# load 1000G file (only column ID, REF, ALT, altaf)
	ref.file = paste0(dirin, "/hg1000nucmtdnab138_c", c, ".csv")
	if (file.exists(paste0(ref.file, ".gz"))) # File unpacked or not?
	ref.file <- gzfile(paste0(ref.file, ".gz"))
	ref.1KG <- read.csv(ref.file,h=T,stringsAsFactors=T)[,c(aid,aref,alt,aaltaf)]
	ref.1KG <- ref.1KG

	# read info file
	info <- read.delim.guess(stdinfo,h=T,stringsAsFactors=F)
	info.fields <- c(bid,bcoded,bnoncoded,bcodedaf)

	stopifnot(all(info.fields %in% colnames(info)))
	info <- info[,info.fields]
	info[[aid]] <- info[[bid]] # for merging purposes

	# Assumes no name-overlap between bcoded,bnoncoded,bcodedaf and aref,alt,aaltaf
	ref.and.info <- merge(info, ref.1KG, by=aid)

	ref.and.info$flip <- rep(FALSE, nrow(ref.and.info))
	ref.and.info$origin <- rep(NA, nrow(ref.and.info))

	# read dosages file
	dosages <- read.delim.guess(stddosage,h=T)

	for(snp in 1:nrow(ref.and.info)) {
	a0 <- as.character(ref.and.info[snp,][[bnoncoded]])
	a1 <- as.character(ref.and.info[snp,][[bcoded]])
	ref.a0 <- as.character(ref.and.info[snp ,][[aref]])
	ref.a1 <- as.character(ref.and.info[snp ,][[alt]])

	frequency.mismatch = (abs(ref.and.info[[bcodedaf]]-ref.and.info[[aaltaf]]) > delta)

	if (alleles.are.identical(ref.a0,ref.a1,a0,a1) \|\| (allowindelabbrev && indel.is.identical(ref.a0, ref.a1, a0, a1))) {
	ref.and.info[snp,]$origin <- "matched"
	} else if ((allowindelabbrev && indel.is.flipped(ref.a0, ref.a1, a0, a1)) \|\| alleles.are.flipped(ref.a0,ref.a1,a0,a1)) {
	ref.and.info[snp,]$flip <- TRUE
	ref.and.info[snp,]$origin <- "flipped"
	dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ]
	} else if (alleles.are.compliment(ref.a0,ref.a1,a0,a1)) {
	ref.and.info[snp,]$origin <- "complement"
	} else {
	ref.and.info[snp,]$origin <- "mismatched"
	}

	if (ambigSNPswitch && alleles.are.ambigous(ref.a0,ref.a1,a0,a1) && frequency.mismatch) {
	ref.and.info[snp,]$origin <- "ambigFlip"
	dosages[, ref.and.info[snp,][[aid]] ] <- 2 - dosages[, ref.and.info[snp,][[aid]] ]
	}

	}

	origin.types <- c("matched", "flipped", "complement", "mismatched", "ambigFlip")
	ref.and.info$origin <- factor(ref.and.info$origin, levels=origin.types)

	new.codedaf <- ifelse(ref.and.info$flip, 1-ref.and.info[[bcodedaf]],ref.and.info[[bcodedaf]])
	print(summary(ref.and.info))

	par(mfrow=c(1,2))
	pdf(paste0(dirout, "/", study, ".chr" , c, ".frequencies.pdf"))
	plot(ref.and.info[[bcodedaf]],ref.and.info[[aaltaf]], main=paste(study, "chr", c, "before flip"), xlab="Study allele frequency", ylab="Reference allele frequency", pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
	plot(new.codedaf,ref.and.info[[aaltaf]],main=paste(study, "chr", c, "after flip"), xlab="Study allele frequency", ylab="Reference allele frequency",
	pch=as.numeric(factor(ref.and.info$origin, levels=origin.types, labels=1:5)))
	dev.off()

	ref.and.info[[bcodedaf]] <- new.codedaf

	good.info <- ref.and.info[!(ref.and.info$origin=="mismatched"),]
	good.dosages <- dosages[,which(colnames(dosages) %in% good.info[[aid]])]

	print(paste("retained", nrow(good.info), "out of", nrow(ref.and.info), "variants"))

	# Make updated/extended info file
	write.table(good.info,paste0(dirout,"/",study,".chr", c, ".info"),row.names=F, col.names=T, quot=F, sep="\t")

	# Make updated dosage file
	write.table(good.dosages,paste0(dirout,"/",study,".chr", c, ".dosage"),row.names=T, col.names=T, quot=F,sep="\t")

	# Write mismatched table
	if (sum(ref.and.info$origin == "mismatched") > 0) {
	write.table(ref.and.info[ref.and.info$origin == "mismatched",]$ID, paste0(dirout, "/", study, ".chr" , c, ".mismatched"),row.names=F,col.names=F, quot=F)
	}
	}
	}

	####################### Arguments #############################

	args <- commandArgs(trailingOnly=TRUE)
	opts <- sapply(strsplit(args,"="), function(x) { x[2] } )
	names(opts) <- sapply(strsplit(args,"="), function(x) { x[1] } )


	opts.list <- as.list(opts)

	for(opt in c("startc", "endc", "delta"))
	if(!is.null(opts.list[[opt]]))
	opts.list[[opt]] <- as.numeric(opts.list[[opt]])

	for(opt in c("ambigSNPswitch", "allowindelabbrev"))
	if(!is.null(opts.list[[opt]]))
	opts.list[[opt]] <- as.logical(opts.list[[opt]])

	print(str(opts.list))

	flip.res <- do.call(flip,opts.list)