Skip to content

Instantly share code, notes, and snippets.

@GuangchuangYu
Last active January 4, 2017 07:32
Show Gist options
  • Save GuangchuangYu/91b3396c7e49ab42c565a9cda3c35e18 to your computer and use it in GitHub Desktop.
Save GuangchuangYu/91b3396c7e49ab42c565a9cda3c35e18 to your computer and use it in GitHub Desktop.

No update since 2014.

## Access date: 2017-01-04
$ svn log -q |grep 'z.li'
r87229 | z.li | 2014-03-08 12:13:28 +0800 (Sat, 08 Mar 2014)
r85532 | z.li | 2014-01-15 13:00:43 +0800 (Wed, 15 Jan 2014)
r85454 | z.li | 2014-01-12 13:57:16 +0800 (Sun, 12 Jan 2014)
r72578 | z.li | 2013-01-16 15:11:20 +0800 (Wed, 16 Jan 2013)
r72577 | z.li | 2013-01-16 13:43:05 +0800 (Wed, 16 Jan 2013)
r69358 | z.li | 2012-09-12 15:02:36 +0800 (Wed, 12 Sep 2012)
r68440 | z.li | 2012-08-14 15:56:16 +0800 (Tue, 14 Aug 2012)
r68439 | z.li | 2012-08-14 15:42:45 +0800 (Tue, 14 Aug 2012)

too slow and wrong operation

https://github.com/Bioconductor-mirror/CorMut/blob/release-3.4/R/seqFormat.R#L8

	seq_f01=as.list(sapply(allSeq,function(x)c2s(s2c(x)[ref01!="-"])))

Use gsub will significantly speedup this operation.

Deleting - will convert the sequences from aligned to un-aligned, this is critical, as the package require aligned sequences, https://github.com/Bioconductor-mirror/CorMut/blob/release-3.4/R/seqFormat.R#L2: format=c("clustal","fasta","mase","phylip","msf"), and calculate differences based on codon (codon position will no more aligned after simply deleting -).

wrong calculation of codon difference

Besides with previous issus, each codon was calcuated 3 times.

The count of different codon will be real_diff * 3.

https://github.com/Bioconductor-mirror/CorMut/blob/master/R/CorMut-internal.R#L126-L129

tricky thing

randomly choose a base to substitute ambiguous site

wrong p-value for LOD

https://github.com/Bioconductor-mirror/CorMut/blob/release-3.4/R/kaksAA.R#L67-L77:


			LOD=0
			Nnum=NS+NY
			for(ii in NYs:Nnum){
				per_item=choose(Nnum,ii)*(q^ii)*((1-q)^(Nnum-ii))
				#avoid the appear of NaN
				if(per_item!="NaN"){
					LOD=per_item+LOD
				}
				#print(choose(dim(mat)[2],i)*(q^i)*((1-q)^(dim(mat)[2]-i)))
			}
			LOD=-log10(LOD)

there will be precision issue for:

per_item=choose(Nnum,ii)*(q^ii)*((1-q)^(Nnum-ii))

and NaN will be generated very often.

the author just throw them away by:

if(per_item!="NaN"){
   LOD=per_item+LOD
}

He should use pbinom to calculate the p-value!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment