Variant Effect Predictor Other information
Species with multiple assemblies
With the arrival of GRCh38, Ensembl now supports two different assembly versions for the human genome while users transition from GRCh37. We provide a VEP cache download on the latest software version (78) for both assembly versions.
The VEP installer will install and set up the correct cache and FASTA file for your assembly of interest. If using the --AUTO functionality to install without prompts, remember to add the assembly version required using e.g. "--ASSEMBLY GRCh37". It is also possible to have concurrent installations of caches from both assemblies; just use the --assembly to select the correct one when you run the VEP script.
Once you have installed the relevant cache and FASTA file, you are then able to use the VEP as normal. For those using GRCh37 and requiring database access in addition to the cache (for example, to look up variant identifiers using --format id, see cache limitations), the script will warn you that you must change the database port in order to connect to the correct database:
ERROR: Cache assembly version (GRCh37) and database or selected assembly version (GRCh38) do not match If using human GRCh37 add "--port 3337" to use the GRCh37 database, or --offline to avoid database connection entirely
For users looking to move their data between assemblies, Ensembl provides an assembly converter tool - if you've downloaded the VEP, then you have it already! The script is found in the ensembl-tools/scripts/assembly_converter folder. There is also an online version of the tool available. Both UCSC (liftOver) and NCBI (Remap) also provide tools for converting data between assemblies.
By default the VEP is configured to provide annotation on every genomic feature that each input variant overlaps. This means that if a variant overlaps a gene with multiple alternate splicing variants (transcripts), then a block of annotation for each of these transcripts is reported in the output. In the default VEP output format each of these blocks is written on a single line of output; in VCF output format the blocks are separated by commas in the INFO field.
For many users, however, this depth of annotation is not required, and to this end the VEP provides a number of options to reduce the amount of output produced. Which to choose depends on your motivations and requirements on the output.
NB: Wherever possible we would discourage users from summarising data in this way. Summarising inevitably involves data loss, and invariably at some point this will lead to the loss of biologically relevant information. For example, if your variant overlaps both a regulatory feature and a transcript and you use one of the flags below, the overlap with the regulatory feature will be lost in your output, when in some cases this may be a clue to the "real" functional effect of your variant. For these reasons we encourage users to use one of the flagging options (--flag_pick or --flag_pick_allele) and to post-filter results.
- --pick: this is the option we anticipate will be of use to most users. The VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
- --pick_allele: as above, but chooses one consequence block per variant allele. This can be useful for VCF input files with more than one ALT allele
- --flag_pick: instead of choosing one block and removing the others, this option adds a flag "PICK=1" to picked annotation block, allowing users to easily filter on this later using the VEP's filtering script
- --flag_pick_allele: as above, but flags one block per allele
- --per_gene: as --pick, but chooses one annotation block per gene that the input variant overlaps
- --most_severe: this flag reports only the consequence type of the block with the highest rank, according to this table. Feature-specific annotation is absent from the output using this flag, so use with caution!
- --summary: this flag reports only a comma-separated list of the consequence types predicted for this variant. Feature-specific annotation is absent from the output using this flag, so use with caution!
The VEP script supports using HGVS notations as input. This feature is currently under development, and not all HGVS notation types are supported. Specifically, only notations relative to genomic (g) or coding (c) sequences are currently supported; protein (p) notations are supported in limited fashion due to the complexity involved in determining the multiple possible underlying genomic sequence changes that could produce a single protein change. The script will warn the user if it fails to parse a particular notation.
By default the VEP script uses Ensembl transcripts as its reference for determining consequences, and hence also for HGVS notations. However, it is possible to parse HGVS notations that use RefSeq transcripts as the reference sequence by using the --refseq flag when running the script. Such notations must include the version number of the transcript e.g.
where ".3" denotes that this is version 3 of the transcript NM_080794. See below for more details on how the VEP can use RefSeq transcripts.
Ensembl produces Core schema databases containing alignments of RefSeq transcript objects to the reference genome. This is the otherfeatures database, and is produced for human and mouse. The database also contains alignments of CCDS transcripts and Ensembl EST sequences - they may be included in your analysis using --all_refseq. By passing the --refseq flag when running the VEP script, these alternative transcripts will be used as the reference for predicting variant consequences. Gene IDs given in the output when using this option are generally NCBI GeneIDs.
Users should note that RefSeq sequences may disagree with the reference sequence to which they are aligned, hence results generated when using this option should be interpreted with a degree of caution. A much more complex and stringent process is used to produce the main Ensembl Core database, and this should be used in preference to the RefSeq transcripts.
SIFT and PolyPhen predictions and scores are now calculated and referred to internally using the translated sequence, so predictions are available using the --refseq flag where the RefSeq translation matches the Ensembl translation (they will match in the vast majority of cases - most differences between Ensembl and RefSeq transcripts occur in non-coding regions).
The VEP script can be used to convert files between the various formats that it parses. This may be useful for a user with, for example, a number of variants given in HGVS notation against RefSeq transcript identifiers. The conversion process allows these notations to be converted into genomic reference coordinates, and then used to predict consequences in the VEP against Ensembl transcripts.