EnsemblEnsembl Home

Variant Effect Predictor Other information


Getting VEP to run faster

Set up correctly, VEP is capable of processing around 3 million variants in 30 minutes. There are a number of steps you can take to make sure your VEP installation is running as fast as possible:

  1. Make sure you have the latest version of VEP and the Ensembl API. We regularly introduce optimisations, alongside the new features and bug fixes of a typical new release.

  2. Download a cache file for your species. If you are using --database, you should consider using --cache or --offline instead. Any time VEP has to access data from the database (even if you have a local copy), it will be slower than accessing data in the cache on your local file system.

    Enabling certain flags forces VEP to access the database, and the script will warn you at startup that it will do this with e.g.:

    2011-06-16 16:24:51 - INFO: Database will be accessed when using --check_svs

    Consider carefully whether you need to use these flags in your analysis.

  3. If you use --check_existing or any flags that invoke it (e.g. --af, --af_1kg, --filter_common, --everything), tabix-convert your cache file. Checking for known variants using a converted cache is >100% faster than using the default format.

  4. Download a FASTA file if you use --hgvs or --check_ref. Again, this will prevent VEP accessing the database unnecessarily (in this case to retrieve genomic sequence).

  5. Using forking enables VEP to run multiple parallel "threads", with each thread processing a subset of your input. Most modern computers have more than one processor core, so running VEP with forking enabled can give huge speed increases (3-4x faster in most cases). Even computers with a single core will see speed benefits due to overheads associated with using object-oriented code in Perl.

    To use forking, you must choose a number of forks to use with the --fork flag. Most users should use 4 forks:

    ./vep -i my_input.vcf -fork 4 -offline

    but depending on various factors specific to your setup you may see faster performance with fewer or more forks.

    VEP users writing plugins should be aware that while the VEP code attempts to preserve the state of any plugin-specific cached data between separate forks, there may be situations where data is lost. If you find this is the case, you should disable forking in the new() method of your plugin by deleting the "fork" key from the $config hash.

  6. Make sure your cache and FASTA files are stored on the fastest file system or disk you have available. If you have a lot of memory in your machine, you can even pre-copy the files to memory using tmpfs.

  7. Consider if you need to generate HGVS notations (--hgvs); this is a complex annotation step that can add ~50-80% to your runtime. Note also that --hgvs is switched on by --everything.

  8. Install the Set::Interval tree perl package. This package speeds up VEP's internals by changing how overlaps between variants and transcript components are calculated.

  9. Install the Ensembl::XS package. This contains compiled versions of certain key subroutines used in VEP that will run faster than the default native Perl equivalents. Using this should improve runtime by 5-10%.

  10. Add the --no_stats flag. Calculating statistics adds some runtime to VEP and most users will not need them.

  11. VEP is optimised to run on input files that are sorted in chromosomal order. Unsorted files will still work, albeit more slowly.

  12. For very large files (for example those from whole-genome sequencing), VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:

     tabix -h variants.vcf.gz 12:1000000-20000000 | ./vep -cache -vcf 

Species with multiple assemblies

With the arrival of GRCh38, Ensembl now supports two different assembly versions for the human genome while users transition from GRCh37. We provide a VEP cache download on the latest software version (90) for both assembly versions.

The VEP installer will install and set up the correct cache and FASTA file for your assembly of interest. If using the --AUTO functionality to install without prompts, remember to add the assembly version required using e.g. "--ASSEMBLY GRCh37". It is also possible to have concurrent installations of caches from both assemblies; just use the --assembly to select the correct one when you run VEP.

Once you have installed the relevant cache and FASTA file, you are then able to use VEP as normal. For those using GRCh37 and requiring database access in addition to the cache (for example, to look up variant identifiers using --format id, see cache limitations), the script will warn you that you must change the database port in order to connect to the correct database:

ERROR: Cache assembly version (GRCh37) and database or selected assembly version (GRCh38) do not match

If using human GRCh37 add "--port 3337" to use the GRCh37 database, or --offline to avoid database connection entirely

For users looking to move their data between assemblies, Ensembl provides an assembly converter tool - if you've downloaded VEP, then you have it already! The script is found in the ensembl-tools/scripts/assembly_converter folder. There is also an online version of the tool available. Both UCSC (liftOver) and NCBI (Remap) also provide tools for converting data between assemblies.


Summarising annotation

By default VEP is configured to provide annotation on every genomic feature that each input variant overlaps. This means that if a variant overlaps a gene with multiple alternate splicing variants (transcripts), then a block of annotation for each of these transcripts is reported in the output. In the default VEP output format each of these blocks is written on a single line of output; in VCF output format the blocks are separated by commas in the INFO field.

For many users, however, this depth of annotation is not required, and to this end VEP provides a number of options to reduce the amount of output produced. Which to choose depends on your motivations and requirements on the output.

NB: Wherever possible we would discourage users from summarising data in this way. Summarising inevitably involves data loss, and invariably at some point this will lead to the loss of biologically relevant information. For example, if your variant overlaps both a regulatory feature and a transcript and you use one of the flags below, the overlap with the regulatory feature will be lost in your output, when in some cases this may be a clue to the "real" functional effect of your variant. For these reasons we encourage users to use one of the flagging options (--flag_pick, --flag_pick_allele or --flag_pick_allele_gene) and to post-filter results.

  • --pick: this is the option we anticipate will be of use to most users. VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
    1. canonical status of transcript
    2. APPRIS isoform annotation
    3. transcript support level
    4. biotype of transcript (protein_coding preferred)
    5. CCDS status of transcript
    6. consequence rank according to this table
    7. translated, transcript or feature length (longer preferred)
  • Note that some categories may not be available for the species or cache version that you are using; in these cases the category will be skipped and the next in line used.

  • --pick_allele: as above, but chooses one consequence block per variant allele. This can be useful for VCF input files with more than one ALT allele
  • --per_gene: as --pick, but chooses one annotation block per gene that the input variant overlaps
  • --pick_allele_gene: as above, but chooses one consequence block per variant allele and gene combination.
  • --flag_pick: instead of choosing one block and removing the others, this option adds a flag "PICK=1" to picked annotation block, allowing users to easily filter on this later using VEP's filtering script
  • --flag_pick_allele: as above, but flags one block per allele
  • --flag_pick_allele_gene: as above, but flags one block per allele and gene combination
  • --most_severe: this flag reports only the consequence type of the block with the highest rank, according to this table. Feature-specific annotation is absent from the output using this flag, so use with caution!
  • --summary: this flag reports only a comma-separated list of the consequence types predicted for this variant. Feature-specific annotation is absent from the output using this flag, so use with caution!

HGVS notations

Output

HGVS notations can be produced by VEP using the --hgvs flag. Coding (c.) and protein (p.) notations given against Ensembl identifiers use versioned identifiers that guarantee the identifier refers always to the same sequence.

Genomic HGVS notations may be reported using --hgvsg. Note that the named reference for HGVSg notations will be the chromosome name from the user input (as opposed to the officially recommended chromosome accession).

HGVS notations for insertions or deletions are by default shifted 3-prime relative to the reported transcript or protein sequence in accordance with HGVS specifications. This may lead to discrepancies between the coordinates reported in the HGVS nomenclature and the coordinate columns reported by VEP. You may instruct VEP not to shift using --shift_hgvs 0.

Input

VEP supports using HGVS notations as input. This feature is currently under development, and not all HGVS notation types are supported. Notations relative to genomic (g.) or coding (c.) sequences are currently fully supported; protein (p.) notations are supported in limited fashion due to the complexity involved in determining the multiple possible underlying genomic sequence changes that could produce a single protein change. The script will warn the user if it fails to parse a particular notation.

By default VEP uses Ensembl transcripts as its reference for determining consequences, and hence also for HGVS notations. However, it is possible to parse HGVS notations that use RefSeq transcripts as the reference sequence by using the --refseq flag when running the script. Such notations must include the version number of the transcript e.g.

NM_080794.3:c.1001C>T

where ".3" denotes that this is version 3 of the transcript NM_080794. See below for more details on how VEP can use RefSeq transcripts.


RefSeq transcripts

Ensembl produces databases containing alignments of RefSeq transcript objects to the reference genome, named otherfeatures databases. The otherfeatures databases are used to build the RefSeq cache, and merged with the standard Ensembl core database to produce the merged cache. These caches also contain alignments of CCDS transcripts and Ensembl EST sequences - they may be included in your analysis using --all_refseq.

By using the --refseq flag when running VEP, these alternative transcripts will be used as the reference for predicting variant consequences. Gene IDs given in the output when using this option are generally NCBI GeneIDs.

Users may wish to exclude predicted RefSeq transcripts (those with identifiers beginning with "XM_" or "XR_") by using --exclude_predicted.

Identifiers and other data

VEP's RefSeq cache lacks many classes of data present in the Ensembl transcript cache.

  • Included in the RefSeq cache
    • Gene symbol
    • SIFT and PolyPhen predictions
  • Not included in the RefSeq cache
    • APPRIS annotation
    • TSL annotation
    • Uniprot identifiers
    • CCDS identifiers
    • Protein domains
    • Gene-phenotype association data

Differences to the reference genome

Users should note that RefSeq sequences may differ from the reference genome sequence to which they are aligned. Ensembl's API (and hence VEP) constructs transcript models using the genomic reference sequence. For human cache files from release 90 or newer, differences are accounted for using BAM-edited transcript models. Prior to release 90, and in non-human species, differences between the RefSeq sequence and the genomic sequence were not accounted for, so the genomic sequence was used, meaning that some annotations produced by VEP on these transcripts may have been inaccurate. Most differences occur in non-coding regions, typically in UTRs at either end of transcripts or in the addition of a poly-A tail, meaning minimal impact on VEP's annotations.

For the GRCh38 VEP cache, each RefSeq transcript is annotated with the REFSEQ_MATCH flag indicating whether and how the RefSeq model differs from the underlying genome. Note that currently the REFSEQ_MATCH flag will not be set when using the GRCh37 cache.

Correcting transcript models with BAM files

NCBI have released BAM files that contain alignments of RefSeq transcripts to the genome. From release 90 onwards, these alignments have been incorporated and used to correct the transcript models in the RefSeq and merged cache files (for human GRCh37 and GRCh38 only).

VEP's cache building process uses the sequence and alignment in the BAM to correct the RefSeq model. If the corrected model does not match the original RefSeq sequence in the BAM, the corrected model is discarded. The success or failure of the BAM edit is recorded in the BAM_EDIT field of the VEP output. Failed edits are extremely rare (< 0.01% of transcripts), but any VEP annotations produced on transcripts with a failed edit status should be interpreted with extreme caution.

Using BAM-edited transcripts causes VEP to change how alleles are interpreted from input variants. Input variants are typically encoded in VCFs that are called using the reference genome. This means that the alternate (ALT) allele as given in the VCF may correspond to the reference allele as found in the corrected RefSeq transcript model. VEP will account for this, using the corrected reference allele (by enabling --use_transcript_ref) when calculating consequences, and the GIVEN_REF and USED_REF fields in the VEP output indicate any change made. If the reference allele derived from the transcript matches any given alternate (ALT) allele, then no consequence data will be produced for this allele as it will be considered non-variant. Note that this process may also clash with any interpretation from using --check_ref, so it is recommended to avoid using this flag.

To override the behaviour of --use_transcript_ref and force VEP to use your input reference allele instead of the one derived from the transcript, you may use --use_given_ref.

VEP can also side-load BAM files at runtime to correct transcript models on-the-fly; this may be used, for example, with pre-release 90 cache file or a RefSeq GFF files.

./vep -cache -refseq -i variants.vcf -bam interim_GRCh38.p10_knownrefseq_alignments_2017-01-13.bam

BAM files are available from NCBI:

NB:The BAM index files (.bai) in this directory are required and will need to be renamed as the perl library used to parse the files expects the index to be named [indexed_bam_file].bai:

mv interim_GRCh38.p10_knownrefseq_alignments_2017-01-13.bai interim_GRCh38.p10_knownrefseq_alignments_2017-01-13.bam.bai

Existing or colocated variants

VEP can use the --check_existing flag to identify known variants colocated with user input. VEP's known variant cache is derived from Ensembl's variation database and contains variants from dbSNP and other sources.

VEP by default uses a normalisation-based allele matching algorithm to identify known variants that match user input. Since both input and known variants may have multiple alternate (ALT) or variant alleles, each pair of reference (REF) and ALT alleles are normalised and compared independently to arrive at potential matches. VCF permits multiple allele types to be encoded on the same line, while dbSNP assigns separate rsID identifiers to different allele types at the same locus. This means different alleles from the same input variant may be assigned different known variant identifiers.


Illustration of VEP's allele matching algorithm resolving one VCF line with multiple ALTs to three different variant types and coordinates

Note that allele matching occurs independently of any allele transformations carried out by --minimal; VEP will match to the same identifiers and frequency data regardless of whether the flag is used.

For some data sources (COSMIC, HGMD), Ensembl is not licensed to redistribute allele-specific data, so VEP will report the existence of co-located variants with unknown alleles without carrying out allele matching. To disable this behaviour and exclude these variants, use the --exclude_null_alleles flag.

To disable allele matching completely and compare variant locations only, use --no_check_alleles.

Frequency data

In addition to identifying known variants, VEP also reports allele frequencies for input alleles from major genotyping projects (1000 genomes, ESP and gnomAD). VEP's cache currently contains only frequency data for alleles that have been submitted to dbSNP or are imported via another source into the Ensembl variation database. This means that until gnomAD's full data set is submitted to dbSNP and incorporated into Ensembl, the frequency for some alleles may be missing from VEP's cache data.

To access the full gnomAD data set, it is possible to use VEP's custom annotation feature to retrieve the frequency data directly from the gnomAD VCF files; see instructions here.