Skip to content

Instantly share code, notes, and snippets.

@explodecomputer
Last active August 15, 2025 13:49
Show Gist options
  • Save explodecomputer/6c4f13740f3d1a6db61f7ca56d82f4db to your computer and use it in GitHub Desktop.
Save explodecomputer/6c4f13740f3d1a6db61f7ca56d82f4db to your computer and use it in GitHub Desktop.
Quickly obtain rsids from dbsnp for all variants in VCF file

RSID lookup with tabix

Strategy:

  1. Extract the chr/pos/rsid from dbsnp into a .bed file. Here it makes a separate .bed file per chromosome for parallelisation of lookups
  2. Create a tabix index for the .bed files
  3. Extract the chr/pos from the target VCF file
  4. Use tabix to query the target chr/pos against the tabix indexed .bed files. Parellelised across chromosomes using GNU parallel.
  5. Update the VCF file with the extracted RSIDs

Setup

mamba create -n bcftools -c bioconda -c conda-forge bcftools tabix parallel
mamba activate bcftools
#!/bin/bash
# Create tabix index of rsids from dbsnp vcf file
dbsnp_vcf="/mnt/storage/home/gh13047/mr-eve/vcf-reference-datasets/dbsnp/dbsnp.v153.b37.vcf.gz"
bcftools index -s /mnt/storage/home/gh13047/mr-eve/vcf-reference-datasets/dbsnp/dbsnp.v153.b37.vcf.gz | cut -f 1 > chroms.txt
dbsnp_chrom () {
chrom=$1
time bcftools query \
-e 'ID == "."' \
-r $chrom \
-f '%CHROM\t%POS\t%POS\t%ID\n' /mnt/storage/home/gh13047/mr-eve/vcf-reference-datasets/dbsnp/dbsnp.v153.b37.vcf.gz | bgzip > dbsnp.v153.${chrom}.bed.gz
tabix -f -p bed dbsnp.v153.${chrom}.bed.gz
}
export -f dbsnp_chrom
parallel -j 10 --progress --eta --bar -a chroms.txt "dbsnp_chrom {}"
#!/bin/bash
# Lookup rsids from the chr:pos in a GWAS file
get_rsids () {
fn=$1
threads=$2
output=$3
time bcftools query \
-e 'ID == "."' \
-f '%CHROM\t%POS\n' ${fn} | awk '{print $1"\t"($2-1)"\t"($2+1)}' > variants.bed
# convert to using parallel
ls -S dbsnp.v153.*bed.gz | cut -d "." -f 3 > chroms.txt
time parallel -j 8 --progress --eta --bar -a chroms.txt "tabix dbsnp.v153.{}.bed.gz -R variants.bed > $output{}.txt"
rm variants.bed
cat $output* > ${output}
}
# Example with 9m variants
fn="/mnt/storage/private/mrcieu/research/scratch/IGD/data/public/ieu-a-7/ieu-a-7.vcf.gz"
get_rsids $fn 8 "lookup_parallel_7.txt"
# real 5m54.547s
# user 38m34.893s
# sys 0m14.216s
# Update the vcf file
# Annotate the VCF with rsids
annotate_vcf () {
fn=$1
mapping=$2
output=$3
time bcftools annotate \
-a ${mapping} \
-c CHROM,FROM,TO,ID \
${fn} | bgzip -c > ${output}
}
time annotate_vcf $fn "lookup_parallel_7.txt" "ieu-a-7.annotated.vcf.gz"
# real 0m38.778s
# user 0m41.300s
# sys 0m1.151s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment