Variant Effect Predictor Other information
Getting VEP to run faster
Set up correctly, the VEP script is capable of processing around 3 million variants in 1 hour. There are a number of steps you can take to make sure your VEP installation is running as fast as possible:
Make sure you have the latest version of the VEP and Ensembl API. We regularly introduce optimisations, alongside the new features and bug fixes of a typical new release.
Download a cache file for your species. If you are using --database, you should consider using --cache or --offline instead. Any time the VEP has to access data from the database (even if you have a local copy), it will be slower than accessing data in the cache on your local file system.
Enabling certain flags forces the VEP to access the database, and the script will warn you at startup that it will do this with e.g.:
2011-06-16 16:24:51 - INFO: Database will be accessed when using --check_svs
Consider carefully whether you need to use these flags in your analysis.
Using forking enables the VEP to run multiple parallel "threads", with each thread processing a subset of your input. Most modern computers have more than one processor core, so running the VEP with forking enabled can give huge speed increases (3-4x faster in most cases). Even computers with a single core will see speed benefits due to overheads associated with using object-oriented code in Perl.
To use forking, you must choose a number of forks to use with the --fork flag. Most users should use 4 forks:
perl variant_effect_predictor.pl -i my_input.vcf -fork 4 -offline
but depending on various factors specific to your setup you may see faster performance with fewer or more forks.
VEP users writing plugins should be aware that while the VEP code attempts to preserve the state of any plugin-specific cached data between separate forks, there may be situations where data is lost. If you find this is the case, you should disable forking in the new() method of your plugin by deleting the "fork" key from the $config hash.
If you use --check_existing or any flags that invoke it (e.g. --gmaf, --maf_1kg, --filter_common, --everything), tabix-convert your cache file. Checking for known variants using a converted cache is >100% faster than using the default format.
Make sure your cache and FASTA files are stored on the fastest file system or disk you have available. If you have a lot of memory in your machine, you can even pre-copy the files to memory using tmpfs.
Install the Ensembl::XS package. This contains compiled versions of certain key subroutines used in the VEP that will run faster than the default native Perl equivalents. Using this should improve runtime by 5-10%.
The VEP is optimised to run on input files that are sorted in chromosomal order. Unsorted files will still work, albeit more slowly.
For very large files (for example those from whole-genome sequencing), the VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). The VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:
tabix -h variants.vcf.gz 12:1000000-20000000 | perl variant_effect_predictor.pl -cache -vcf
Species with multiple assemblies
With the arrival of GRCh38, Ensembl now supports two different assembly versions for the human genome while users transition from GRCh37. We provide a VEP cache download on the latest software version (79) for both assembly versions.
The VEP installer will install and set up the correct cache and FASTA file for your assembly of interest. If using the --AUTO functionality to install without prompts, remember to add the assembly version required using e.g. "--ASSEMBLY GRCh37". It is also possible to have concurrent installations of caches from both assemblies; just use the --assembly to select the correct one when you run the VEP script.
Once you have installed the relevant cache and FASTA file, you are then able to use the VEP as normal. For those using GRCh37 and requiring database access in addition to the cache (for example, to look up variant identifiers using --format id, see cache limitations), the script will warn you that you must change the database port in order to connect to the correct database:
ERROR: Cache assembly version (GRCh37) and database or selected assembly version (GRCh38) do not match If using human GRCh37 add "--port 3337" to use the GRCh37 database, or --offline to avoid database connection entirely
For users looking to move their data between assemblies, Ensembl provides an assembly converter tool - if you've downloaded the VEP, then you have it already! The script is found in the ensembl-tools/scripts/assembly_converter folder. There is also an online version of the tool available. Both UCSC (liftOver) and NCBI (Remap) also provide tools for converting data between assemblies.
By default the VEP is configured to provide annotation on every genomic feature that each input variant overlaps. This means that if a variant overlaps a gene with multiple alternate splicing variants (transcripts), then a block of annotation for each of these transcripts is reported in the output. In the default VEP output format each of these blocks is written on a single line of output; in VCF output format the blocks are separated by commas in the INFO field.
For many users, however, this depth of annotation is not required, and to this end the VEP provides a number of options to reduce the amount of output produced. Which to choose depends on your motivations and requirements on the output.
NB: Wherever possible we would discourage users from summarising data in this way. Summarising inevitably involves data loss, and invariably at some point this will lead to the loss of biologically relevant information. For example, if your variant overlaps both a regulatory feature and a transcript and you use one of the flags below, the overlap with the regulatory feature will be lost in your output, when in some cases this may be a clue to the "real" functional effect of your variant. For these reasons we encourage users to use one of the flagging options (--flag_pick or --flag_pick_allele) and to post-filter results.
- --pick: this is the option we anticipate will be of use to most users. The VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
- --pick_allele: as above, but chooses one consequence block per variant allele. This can be useful for VCF input files with more than one ALT allele
- --flag_pick: instead of choosing one block and removing the others, this option adds a flag "PICK=1" to picked annotation block, allowing users to easily filter on this later using the VEP's filtering script
- --flag_pick_allele: as above, but flags one block per allele
- --per_gene: as --pick, but chooses one annotation block per gene that the input variant overlaps
- --most_severe: this flag reports only the consequence type of the block with the highest rank, according to this table. Feature-specific annotation is absent from the output using this flag, so use with caution!
- --summary: this flag reports only a comma-separated list of the consequence types predicted for this variant. Feature-specific annotation is absent from the output using this flag, so use with caution!
The VEP script supports using HGVS notations as input. This feature is currently under development, and not all HGVS notation types are supported. Specifically, only notations relative to genomic (g) or coding (c) sequences are currently supported; protein (p) notations are supported in limited fashion due to the complexity involved in determining the multiple possible underlying genomic sequence changes that could produce a single protein change. The script will warn the user if it fails to parse a particular notation.
By default the VEP script uses Ensembl transcripts as its reference for determining consequences, and hence also for HGVS notations. However, it is possible to parse HGVS notations that use RefSeq transcripts as the reference sequence by using the --refseq flag when running the script. Such notations must include the version number of the transcript e.g.
where ".3" denotes that this is version 3 of the transcript NM_080794. See below for more details on how the VEP can use RefSeq transcripts.
Ensembl produces Core schema databases containing alignments of RefSeq transcript objects to the reference genome. This is the otherfeatures database, and is produced for human and mouse. The database also contains alignments of CCDS transcripts and Ensembl EST sequences - they may be included in your analysis using --all_refseq. By passing the --refseq flag when running the VEP script, these alternative transcripts will be used as the reference for predicting variant consequences. Gene IDs given in the output when using this option are generally NCBI GeneIDs.
Users should note that RefSeq sequences may disagree with the reference sequence to which they are aligned, hence results generated when using this option should be interpreted with a degree of caution. A much more complex and stringent process is used to produce the main Ensembl Core database, and this should be used in preference to the RefSeq transcripts.
SIFT and PolyPhen predictions and scores are now calculated and referred to internally using the translated sequence, so predictions are available using the --refseq flag where the RefSeq translation matches the Ensembl translation (they will match in the vast majority of cases - most differences between Ensembl and RefSeq transcripts occur in non-coding regions).
The VEP script can be used to convert files between the various formats that it parses. This may be useful for a user with, for example, a number of variants given in HGVS notation against RefSeq transcript identifiers. The conversion process allows these notations to be converted into genomic reference coordinates, and then used to predict consequences in the VEP against Ensembl transcripts.