Skip to content

Instantly share code, notes, and snippets.

@sirselim
Last active November 1, 2023 17:08
Show Gist options
  • Save sirselim/dcaad07523c90b46c1c0685efbc5d04e to your computer and use it in GitHub Desktop.
Save sirselim/dcaad07523c90b46c1c0685efbc5d04e to your computer and use it in GitHub Desktop.
small bash script to download dbNSFP 'database' and wrangle into format for pipeline annotation process
#!/bin/bash
## small bash script to download and reformat dbNSFP for pipeline
## Miles Benton
## created: 2018-01-13
## modified: 2019-08-21
# Set to dbNSFP version to download and build
version="4.0a"
#TODO: add an option to 'scrape' this from the url to always return latest version
# define thread number for parallel processing where able
THREADS=6
# Download dbNSFP database
wget -O dbNSFP${version}.zip "ftp://dbnsfp:[email protected]/dbNSFP${version}.zip"
# Uncompress
unzip dbNSFP${version}.zip
# grab header
zcat dbNSFP${version}_variant.chr1.gz | head -n 1 | bgzip > header.gz
### this section will produce data for hg38 capable pipelines
## hg38 version
# Create a single file version
# NOTE: bgzip parameter -@ X represents number of threads
cat dbNSFP${version}_variant.chr{1..22}.gz dbNSFP${version}_variant.chrX.gz dbNSFP${version}_variant.chrY.gz dbNSFP${version}_variant.chrM.gz | zgrep -v '#chr' | bgzip -@ ${THREADS} > dbNSFPv${version}_custom.gz
#TODO: there must be a 'nicer' way of ordering the input into the cat (to include the X,Y and M chrs without explicitly stating them)
# add header back into file
cat header.gz dbNSFPv${version}_custom.gz > dbNSFPv${version}_custombuild.gz
# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFPv${version}_custombuild.gz
# test annotation
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}_custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf
#TODO: provide actual unit test files for testing purposes, i.e. a section of public data with known annotation rates.
#TODO: the above is currently a placeholder but it had it's intended purpose in terms of identifying incorrect genome build.
# clean up
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates')
#/END hg38
###
### this section will produce data for hg19 capable pipelines
## hg19 version
# for hg19 (coordinate data is located in columns 8 [chr] and 9 [position])
# this takes output from above, filters out any variants with no hg19 coords and then sorts on hg19 chr and position, and then bgzips output
# NOTE: bgzip parameter -@ X represents number of threads
# create a file with ordered chrosome names
echo {1..22} X Y M | tr ' ' '\n' > chromosomeList.txt
# set awk params as variable for parallel
awkBody1='$8 != "."'
awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = $1; print;}'
# parallel implementation of the hg19 format and sort
parallel -j 4 \
"zgrep -v '#chr' dbNSFP${version}_variant.chr{}.gz | \
awk '${awkBody1}' | \
awk '${awkBody2}' | \
LC_ALL=C sort --parallel=2 -n -S 1G -T . -k 1,1 -k 2,2 --compress-program=gzip | \
bgzip -@ 2 > dbNSFPv${version}.chr{}.hg19.custombuild.gz" :::: chromosomeList.txt
#TODO: need to implement a defined approach to GNU parallel, can't use $THREADS
# cat all files back together (should retain sort order)
cat header.gz dbNSFPv${version}.chr{1..22}.hg19.custombuild.gz dbNSFPv${version}.chrX.hg19.custombuild.gz dbNSFPv${version}.chrY.hg19.custombuild.gz dbNSFPv${version}.chrM.hg19.custombuild.gz > dbNSFPv${version}.hg19.custombuild.gz
# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFPv${version}.hg19.custombuild.gz
# test hg19 annotation
# java -jar ~/install/snpEff/SnpSift.jar dbnsfp -v -db /mnt/dbNSFP/hg19/dbNSFPv${version}.hg19.custombuild.gz test/chr1_test.vcf > test/chr1_test_anno.vcf
# clean up
#TODO: add clean up step to rm all intermediate files after testing confirmed working (i.e. correct annotation 'rates')
#/END hg38
###
@sirselim
Copy link
Author

sirselim commented Nov 8, 2019

Hi there. Yes, I was able to build the dbNSFP file and annotate using the latest version (4.0a). We are currently working on automatically generating the formatted file and then hosting it for public download. This will run whenever there is a new release of dbNSFP.

What problem are you having in particular?

@robinpalvadeau
Copy link

Hi @sirselime,

Thanks for the script. However I am facing a problem while generating tabix file.

version=4.0c
tabix -s 1 -b 2 -e 2 dbNSFPv${version}.hg19.custombuild.gz
[E::hts_idx_push] Chromosome blocks not continuous tbx_index_build failed: dbNSFPv4.0c.hg19.custombuild.gz

Isn't it strange? Because the file has been merged as:
cat header.gz dbNSFPv${version}.chr{1..22}.hg19.custombuild.gz dbNSFPv${version}.chrX.hg19.custombuild.gz dbNSFPv${version}.chrY.hg19.custombuild.gz dbNSFPv${version}.chrM.hg19.custombuild.gz > dbNSFPv${version}.hg19.custombuild.gz
Have you tried with 4.0c version ?

@misrak
Copy link

misrak commented Dec 6, 2019

Hi,
I didn't combine them into one. I kept separately chromosome-wise and I did the tabix index chromosome-wise and it worked as I could distribute it over different nodes. But, it is strange. Although, I think the sequence of chr should be concurrent with your vcf file.
The vcf file generated with the pipeline we use comes out like this: 1,2,3,4,5,6,7,X,8,9,10,11,12,13,14,15,16,17,18,20,Y,19,22,21,M. We had to modify other databases similarly. Hopefully, this helps

@sirselim
Copy link
Author

Hi @robinpalvadeau

I haven't tried using the commercial version (4.0c) so I'm not sure if there is an issue introduced there or not. I know that the current 4.0a version seems to work OK. What I think your issue is is that the current versions of this 'database' are provided as hg38 only, hence why I had to move some columns around. I need to add some changes to the above script to get around the fact that there are some SNPs which are actually on different chromosomes between the two build (i.e. a SNP might be on chr 6 in hg38 but on chr14 in hg19). Thus there would have to be an additional sorting step. I have made those changes to a more complete version of this script and just need to find the time to reflect them back here. I would suggest trying to debug your sort order based on the chr and position columns for hg19 and ensuring that they are in the correct order (more on this below).

@misrak as far as I'm aware the sorting for the resultant file is not dependant on the sorted order of the vcf (this is a downstream file so that doesn't make much sense...). The resultant dbNSFP output is required to be in sorted numerical order, the 'standard' order of chromosomes, i.e. 1:22, X, Y, M. I personally leave this large file intact and then use snpSift to split my vcf file(s) by chr, do my annotation in parallel across 25 threads, and then merge them back into a single vcf again. I find that this is actually much faster than deploying across different nodes, but obviously your mileage may very.

  • as a side note: I'm actually surprised that you don't have downstream issues in your pipeline with the vcf order you mentioned, I know GATK wouldn't like it right? But I guess if that's how your reference file is sorted then things would probably work out...

@misrak
Copy link

misrak commented Dec 13, 2019

@sirselim GATK doesn't give me a problem because of hg19.fa the reference file and the bam file are coherent with each other. Yes, you said it. Btw, I use snpSift too for splitting. :) Thank you!

@dariogf
Copy link

dariogf commented Feb 27, 2020

The problem that @robinpalvadeau is having is because there are some "bad" lines in chr6, chr17 and chr22 that when they are filtered by awk to convert to hg19 format, seems to belong to another chr. For example, this line from chr6 that is in chr6:

6       106971480       T       A       X       C       .       Y       21154172        Y       19613560        38      CD24    ENSG00000272398 ENST00000620405 ENSP00000478302

is converted to that seems to be in chrY:

Y       21154172      T       A       X       C       .   ---cont ---

I am not sure why these lines are there, but I think is better to filter them out with a grep in the parallel filter, this way:

# parallel implementation of the hg19 format and sort
parallel -j 4 \
"zgrep -v '#chr' dbNSFP${version}_variant.chr{}.gz | \
awk '${awkBody1}' | \
awk '${awkBody2}' |  grep '^{}' | \
LC_ALL=C sort --parallel=2 -n -S 1G -T . -k 1,1 -k 2,2 --compress-program=gzip | \
bgzip -@ 2 > dbNSFPv${version}.chr{}.hg19.custombuild.gz" :::: chromosomeList.txt
#TODO: need to implement a defined approach to GNU parallel, can't use $THREADS 

By the way, I also think that there is a mistake in this line (it is not important because $8 column is not used later):

awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = $1; print;}'

that should be:

awkBody2='BEGIN{FS=OFS="\t"} {t = $2; $2 = $9; $9 = t; x = $1; $1 = $8; $8 = x; print;}'

@sirselim
Copy link
Author

@dariogf

The problem that @robinpalvadeau is having is because there are some "bad" lines in chr6, chr17 and chr22 that when they are filtered by awk to convert to hg19 format, seems to belong to another chr. For example, this line from chr6 that is in chr6:

I mentioned these variants in my response above, they are not 'bad' variants and I think it is dangerous to assume such, thus I don't remove them. That's why the creation of dbNSFP hg19 is a little trickier, because you need to identify the variants that map to different chromsomes between the builds, move them to the correct location and then apply the sorting. I have this all written up, but it unfortunately takes what i consider to be too much RAM to do it efficiently. So my current solution is to build the most recent version of dbNSFP once (for both hg19 and hg38) on our cluster and then distribute across other machines/locations as required.

What I'm doing in the snippets you have highlighted is moving the columns into the correct location (the awk script you mention). By default dbNSFP uses hg38 positions in columns 1 and 2, and then has hg19 positions in columns 8 and 9. If you haven't (and are interested) I suggest you familiarise yourself with the dbNSFP structure (most recent readme: https://drive.google.com/file/d/1HPcIEE1USiwCPzRFFwZhTZagka1uyUV1/view):

...
Columns of dbNSFP_variant:
1	chr: chromosome number
2	pos(1-based): physical position on the chromosome as to hg38 (1-based coordinate).
		For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920). 
3	ref: reference nucleotide allele (as on the + strand)
4	alt: alternative nucleotide allele (as on the + strand)
5	aaref: reference amino acid
		"." if the variant is a splicing site SNP (2bp on each end of an intron)
6	aaalt: alternative amino acid
		"." if the variant is a splicing site SNP (2bp on each end of an intron)
7	rs_dbSNP151: rs number from dbSNP 151
8	hg19_chr: chromosome as to hg19, "." means missing
9	hg19_pos(1-based): physical position on the chromosome as to hg19 (1-based coordinate).
		For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
10	hg18_chr: chromosome as to hg18, "." means missing
11	hg18_pos(1-based): physical position on the chromosome as to hg18 (1-based coordinate)
		For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
...

I need to revisit this little side project at some stage, but as it's currently implemented it is working for our purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment