-
-
Save sirselim/dcaad07523c90b46c1c0685efbc5d04e to your computer and use it in GitHub Desktop.
#!/bin/bash | |
## small bash script to download and reformat dbNSFP for pipeline | |
## Miles Benton | |
## created: 2018-01-13 | |
## modified: 2019-08-21 | |
# Set to dbNSFP version to download and build | |
version="4.0a" | |
#TODO: add an option to 'scrape' this from the url to always return latest version | |
# define thread number for parallel processing where able | |
THREADS=6 | |
# Download dbNSFP database | |
wget -O dbNSFP${version}.zip "ftp://dbnsfp:[email protected]/dbNSFP${version}.zip" | |
# Uncompress | |
unzip dbNSFP${version}.zip | |
# grab header | |
zcat dbNSFP${version}_variant.chr1.gz | head -n 1 | bgzip > header.gz | |
### this section will produce data for hg38 capable pipelines | |
## hg38 version | |
# Create a single file version | |
# NOTE: bgzip parameter -@ X represents number of threads | |
cat dbNSFP${version}_variant.chr{1..22}.gz dbNSFP${version}_variant.chrX.gz dbNSFP${version}_variant.chrY.gz dbNSFP${version}_variant.chrM.gz | zgrep -v '#chr' | bgzip -@ ${THREADS} > dbNSFPv${version}_custom.gz | |
#TODO: there must be a 'nicer' way of ordering the input into the cat (to include the X,Y and M chrs without explicitly stating them) | |
# add header back into file | |
cat header.gz dbNSFPv${version}_custom.gz > dbNSFPv${version}_custombuild.gz | |
# Create tabix index | |
tabix -s 1 -b 2 -e 2 dbNSFPv${version}_custombuild.gz | |
# test annotation | |
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}_custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf | |
#TODO: provide actual unit test files for testing purposes, i.e. a section of public data with known annotation rates. | |
#TODO: the above is currently a placeholder but it had it's intended purpose in terms of identifying incorrect genome build. | |
# clean up | |
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates') | |
#/END hg38 | |
### | |
### this section will produce data for hg19 capable pipelines | |
## hg19 version | |
# for hg19 (coordinate data is located in columns 8 [chr] and 9 [position]) | |
# this takes output from above, filters out any variants with no hg19 coords and then sorts on hg19 chr and position, and then bgzips output | |
# NOTE: bgzip parameter -@ X represents number of threads | |
# create a file with ordered chrosome names | |
echo {1..22} X Y M | tr ' ' '\n' > chromosomeList.txt | |
# set awk params as variable for parallel | |
awkBody1='$8 != "."' | |
awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = $1; print;}' | |
# parallel implementation of the hg19 format and sort | |
parallel -j 4 \ | |
"zgrep -v '#chr' dbNSFP${version}_variant.chr{}.gz | \ | |
awk '${awkBody1}' | \ | |
awk '${awkBody2}' | \ | |
LC_ALL=C sort --parallel=2 -n -S 1G -T . -k 1,1 -k 2,2 --compress-program=gzip | \ | |
bgzip -@ 2 > dbNSFPv${version}.chr{}.hg19.custombuild.gz" :::: chromosomeList.txt | |
#TODO: need to implement a defined approach to GNU parallel, can't use $THREADS | |
# cat all files back together (should retain sort order) | |
cat header.gz dbNSFPv${version}.chr{1..22}.hg19.custombuild.gz dbNSFPv${version}.chrX.hg19.custombuild.gz dbNSFPv${version}.chrY.hg19.custombuild.gz dbNSFPv${version}.chrM.hg19.custombuild.gz > dbNSFPv${version}.hg19.custombuild.gz | |
# Create tabix index | |
tabix -s 1 -b 2 -e 2 dbNSFPv${version}.hg19.custombuild.gz | |
# test hg19 annotation | |
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}.hg19.custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf | |
# clean up | |
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates') | |
#/END hg38 | |
### |
The problem that @robinpalvadeau is having is because there are some "bad" lines in chr6, chr17 and chr22 that when they are filtered by awk to convert to hg19 format, seems to belong to another chr. For example, this line from chr6 that is in chr6:
6 106971480 T A X C . Y 21154172 Y 19613560 38 CD24 ENSG00000272398 ENST00000620405 ENSP00000478302
is converted to that seems to be in chrY:
Y 21154172 T A X C . ---cont ---
I am not sure why these lines are there, but I think is better to filter them out with a grep in the parallel filter, this way:
# parallel implementation of the hg19 format and sort
parallel -j 4 \
"zgrep -v '#chr' dbNSFP${version}_variant.chr{}.gz | \
awk '${awkBody1}' | \
awk '${awkBody2}' | grep '^{}' | \
LC_ALL=C sort --parallel=2 -n -S 1G -T . -k 1,1 -k 2,2 --compress-program=gzip | \
bgzip -@ 2 > dbNSFPv${version}.chr{}.hg19.custombuild.gz" :::: chromosomeList.txt
#TODO: need to implement a defined approach to GNU parallel, can't use $THREADS
By the way, I also think that there is a mistake in this line (it is not important because $8 column is not used later):
awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = $1; print;}'
that should be:
awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = x; print;}'
The problem that @robinpalvadeau is having is because there are some "bad" lines in chr6, chr17 and chr22 that when they are filtered by awk to convert to hg19 format, seems to belong to another chr. For example, this line from chr6 that is in chr6:
I mentioned these variants in my response above, they are not 'bad' variants and I think it is dangerous to assume such, thus I don't remove them. That's why the creation of dbNSFP hg19 is a little trickier, because you need to identify the variants that map to different chromsomes between the builds, move them to the correct location and then apply the sorting. I have this all written up, but it unfortunately takes what i consider to be too much RAM to do it efficiently. So my current solution is to build the most recent version of dbNSFP once (for both hg19 and hg38) on our cluster and then distribute across other machines/locations as required.
What I'm doing in the snippets you have highlighted is moving the columns into the correct location (the awk script you mention). By default dbNSFP uses hg38 positions in columns 1 and 2, and then has hg19 positions in columns 8 and 9. If you haven't (and are interested) I suggest you familiarise yourself with the dbNSFP structure (most recent readme: https://drive.google.com/file/d/1HPcIEE1USiwCPzRFFwZhTZagka1uyUV1/view):
...
Columns of dbNSFP_variant:
1 chr: chromosome number
2 pos(1-based): physical position on the chromosome as to hg38 (1-based coordinate).
For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920).
3 ref: reference nucleotide allele (as on the + strand)
4 alt: alternative nucleotide allele (as on the + strand)
5 aaref: reference amino acid
"." if the variant is a splicing site SNP (2bp on each end of an intron)
6 aaalt: alternative amino acid
"." if the variant is a splicing site SNP (2bp on each end of an intron)
7 rs_dbSNP151: rs number from dbSNP 151
8 hg19_chr: chromosome as to hg19, "." means missing
9 hg19_pos(1-based): physical position on the chromosome as to hg19 (1-based coordinate).
For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
10 hg18_chr: chromosome as to hg18, "." means missing
11 hg18_pos(1-based): physical position on the chromosome as to hg18 (1-based coordinate)
For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
...
I need to revisit this little side project at some stage, but as it's currently implemented it is working for our purposes.
@sirselim GATK doesn't give me a problem because of hg19.fa the reference file and the bam file are coherent with each other. Yes, you said it. Btw, I use snpSift too for splitting. :) Thank you!