Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active July 9, 2018 16:53
Show Gist options
  • Save ckandoth/57d189f018b448774704d3b2191720a6 to your computer and use it in GitHub Desktop.
Save ckandoth/57d189f018b448774704d3b2191720a6 to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v84 with various caches for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

To follow these instructions, we'll assume you have these packaged essentials installed:

sudo yum install -y curl rsync tar make perl perl-core
## OR ##
sudo apt-get install -y curl rsync tar make perl perl-base

VEP requires Perl 5.10 or newer, but I'd strongly recommend at least 5.18. If your system Perl is outdated, then follow this gist to set up Perl 5.22. Once that's done, set temporary shell variables pointing to where Perl and its libraries live. Change this as needed for your system:

export PERL_PATH=/opt/common/CentOS_6-dev/perl/perl-5.22.0
export PERL_BIN=/opt/common/CentOS_6-dev/perl/perl-5.22.0/bin/perl

Handle VEP's Perl dependencies using cpanminus to install them under $PERL_PATH:

curl -L http://cpanmin.us | $PERL_BIN - --notest -l $PERL_PATH LWP LWP::Simple LWP::Protocol::https Archive::Extract Archive::Tar Archive::Zip CGI DBI Time::HiRes DBD::mysql Encode File::Copy::Recursive Perl::OSType Module::Metadata version

Set PERL5LIB to find those libraries. Add this to the end of your ~/.bashrc to make it persistent:

export PERL5LIB=$PERL_PATH/lib/perl5:$PERL_PATH/lib/perl5/x86_64-linux

Create temporary shell variables pointing to where we'll store VEP and its cache data (non default paths can be used, but specify --vep-path and --vep-data when running vcf2maf or maf2maf):

export VEP_PATH=/opt/common/CentOS_6-dev/vep/v84
export VEP_DATA=/opt/common/CentOS_6-dev/vep/v84

Download the v84 release of VEP:

mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/84.tar.gz
tar -zxf 84.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'

Add that path to PERL5LIB, and the htslib subfolder to PATH where tabix will be installed:

export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-84/variation/VEP/homo_sapiens_vep_84_GRCh{37,38}.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-84/variation/VEP/mus_musculus_vep_84_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_84_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38, and some neat VEP plugins:

$PERL_BIN INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh37 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
$PERL_BIN INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh38 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
$PERL_BIN INSTALL.pl --AUTO afp --SPECIES mus_musculus --ASSEMBLY GRCm38 --PLUGINS UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

$PERL_BIN convert_cache.pl --species homo_sapiens --version 84_GRCh37,84_GRCh38 --dir $VEP_DATA
$PERL_BIN convert_cache.pl --species mus_musculus --version 84_GRCm38 --dir $VEP_DATA

Download and index a custom ExAC r0.3 VCF, that skips variants overlapping known somatic hotspots:

curl -L https://googledrive.com/host/0B6o74flPT8FAYnBJTk9aTF9WVnM > $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
tabix -p vcf $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz

Test running VEP in offline mode with the ExAC plugin, on the provided sample GRCh37 VCF:

$PERL_BIN variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --everything --shift_hgvs 1 --check_existing --check_alleles --total_length --allele_number --no_escape --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/84_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --plugin ExAC,$VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
@Pranav-arraygen
Copy link

Hello
I am having problems in creating directories even after giving write permission.

Following error pops-up

./vep --cache -i variation.vcf -o output.txt

-------------------- EXCEPTION --------------------
MSG: ERROR: Cache directory /home/arraygen/.vep/homo_sapiens not found

STACK Bio::EnsEMBL::VEP::CacheDir::dir /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:282
STACK Bio::EnsEMBL::VEP::CacheDir::init /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:206
STACK Bio::EnsEMBL::VEP::CacheDir::new /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:110
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:115
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:91
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175
STACK Bio::EnsEMBL::VEP::Runner::init /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:122
STACK Bio::EnsEMBL::VEP::Runner::run /home/arraygen/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:181
STACK toplevel ./vep:211
Date (localtime) = Fri Jun 23 10:09:20 2017
Ensembl API version = 89
Pls let me know solution....

@casina
Copy link

casina commented Dec 12, 2017

Hi, I am all new to this but I have been trying to get this working for months now!

Everything appears to install fine. But when I run the final step to test running VEP in offline mode:

2017-12-13 02:48:17 - Read existing cache info

BEGIN failed--compilation aborted at (eval 22) line 2.

2017-12-13 02:53:28 - Starting...
2017-12-13 02:53:28 - Detected format of input file as vcf
2017-12-13 02:53:28 - Read 173 variants into buffer
2017-12-13 02:53:28 - Checking for existing variations
2017-12-13 02:53:28 - Reading transcript data from cache and/or database
2017-12-13 02:53:29 - Retrieved 3097 transcripts (0 mem, 3162 cached, 0 DB, 65 duplicates)
2017-12-13 02:53:29 - Reading regulatory data from cache and/or database
2017-12-13 02:53:29 - Retrieved 14876 regulatory features (0 mem, 14877 cached, 0 DB, 1 duplicates)
2017-12-13 02:53:29 - Analyzing chromosome 21
2017-12-13 02:53:29 - Analyzing variants
2017-12-13 02:53:29 - Analyzing MotifFeatures
2017-12-13 02:53:29 - Analyzing RegulatoryFeatures
2017-12-13 02:53:29 - Calculating consequences
Segmentation fault (core dumped)

When I go to the .vep folder I see that the fa.gz file is there
-rw-rw-r-- 1 3008M Mar 23 2017 Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
-rwxrwxr-x 1 830M Mar 24 2017 Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
-rw-rw-r-- 1 1M Feb 9 2017 Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz.fai
-rw-r--r-- 1 1M Mar 23 2017 Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.index

Any help would be much appreciated!

@dullahan8
Copy link

I've run into a couple of issues

  1. $PERL_BIN convert_cache.pl --species mus_musculus --version 84_GRCm38 --dir $VEP_DATA
    2018-07-09 09:27:32 - Processing mus_musculus 2018-07-09 09:27:32 - Processing version 84_GRCm38 Can't use an undefined value as an ARRAY reference at convert_cache.pl line 188.
  2. tabix -p vcf $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
    Not a BGZF file: /opt/vcf2maf/vep/ExAC.r0.3.sites.minus_somatic.vcf.gz tbx_index_build failed: /opt/vcf2maf/vep/ExAC.r0.3.sites.minus_somatic.vcf.gz
    Any help would be appreciated. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment