Created
December 14, 2017 23:37
-
-
Save reedacartwright/334fc1b1621f6bd2abe699ffc3736bc5 to your computer and use it in GitHub Desktop.
Script to calculate the entropy of a mutation specturm from a vcf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This script calculates the average entropy and effective number of | |
# possible mutants for the human mutation spectrum. | |
# Preparing to run this script | |
# > wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/common_all_20170710.vcf.gz | |
# > wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/common_all_20170710.vcf.gz.tbi | |
# > bcftools query -f '%REF\t%ALT\n' common_all_20170710.vcf.gz | zstd > altref.txt.zst | |
library(data.table) | |
library(stringr) | |
aa = fread("zstdcat altref.txt.zst", header=FALSE) | |
names(aa) = c("REF", "ALT") | |
a = aa[!str_detect(aa$ALT,","),] | |
a[,Nuc := str_sub(REF,1,1)] | |
a[,Masked := str_replace_all(REF,".","N")] | |
# Due to the way VCF records indels, we will split the input based on the first | |
# character of the reference, and mask the reference string with Ns. | |
b = split(str_c(a$Masked,a$ALT,sep="/"), a$Nuc) | |
# estimate the entropy of each subset | |
x = lapply(b, table) | |
p = lapply(x, function(y) y/sum(y)) | |
h = sapply(p, function(q) -sum(q*log2(q))) | |
# take the average over each subset | |
n = sapply(b,length) | |
hEntropy = weighted.mean(h,n) # 1.998921 | |
kEffective = 2**hEntropy # 3.998008 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment