Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
The official instructions to install VEP have never worked well from the United States because of the flaky network connection to their FTP servers in the UK. So, we will instead use conda to install VEP and its dependencies and then manually download VEP caches and reference genomes using rsync.
If you don't already have conda, download and install it into $HOME/miniconda3
:
curl -sL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
bash miniconda.sh -bup $HOME/miniconda3 && rm -f miniconda.sh
Add the following lines to your ~/.bashrc
, and then logout and login to add conda
to your $PATH
:
# Add conda to PATH if found
if [ -f "$HOME/miniconda3/etc/profile.d/conda.sh" ]; then
. $HOME/miniconda3/etc/profile.d/conda.sh
fi
Update conda to the latest version and configure it to use libmamba, a faster dependency solver:
conda update -y -n base -c defaults conda
conda config --set solver libmamba
Create and activate a conda environment with VEP, its dependencies, and other related tools:
conda create -y -n vep && conda activate vep
conda install -y -c conda-forge -c bioconda -c defaults ensembl-vep==112.0 htslib==1.20 bcftools==1.20 samtools==1.20 ucsc-liftover==447
Download VEP's offline cache for GRCh38, and the reference FASTA:
mkdir -p $HOME/.vep/homo_sapiens/112_GRCh38/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/variation/indexed_vep_cache/homo_sapiens_vep_112_GRCh38.tar.gz $HOME/.vep/
tar -zxf $HOME/.vep/homo_sapiens_vep_112_GRCh38.tar.gz -C $HOME/.vep/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/fasta/homo_sapiens/dna_index/ $HOME/.vep/homo_sapiens/112_GRCh38/
(Optional) Download VEP's offline cache for GRCh37, and the reference FASTA which we must bgzip instead of gzip:
mkdir -p $HOME/.vep/homo_sapiens/112_GRCh37/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/variation/indexed_vep_cache/homo_sapiens_vep_112_GRCh37.tar.gz $HOME/.vep/
tar -zxf $HOME/.vep/homo_sapiens_vep_112_GRCh37.tar.gz -C $HOME/.vep/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/grch37/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz $HOME/.vep/homo_sapiens/112_GRCh37/
gzip -d $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
bgzip -i $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa
samtools faidx $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
Test running VEP in offline mode on a GRCh38 VCF:
curl -sLO https://raw.githubusercontent.com/Ensembl/ensembl-vep/release/112/examples/homo_sapiens_GRCh38.vcf
vep --species homo_sapiens --assembly GRCh38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $HOME/.vep --fasta $HOME/.vep/homo_sapiens/112_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --input_file homo_sapiens_GRCh38.vcf --output_file homo_sapiens_GRCh38.vep.vcf --polyphen b --af --af_1kg --regulatory