Variant Effect Predictor Other information
Species with multiple assemblies
With the arrival of GRCh38, Ensembl now supports two different assembly versions for the human genome while users transition from GRCh37. We provide a VEP cache download on the latest software version (76) for both assembly versions.
The VEP installer will install and set up the correct cache and FASTA file for your assembly of interest. If using the --AUTO functionality to install without prompts, remember to add the assembly version required using e.g. "--ASSEMBLY GRCh37". It is also possible to have concurrent installations of caches from both assemblies; just use the --assembly to select the correct one when you run the VEP script.
Once you have installed the relevant cache and FASTA file, you are then able to use the VEP as normal. For those using GRCh37 and requiring database access in addition to the cache (for example, to look up variant identifiers using --format id, see cache limitations), the script will warn you that you must change the database port in order to connect to the correct database:
ERROR: Cache assembly version (GRCh37) and database or selected assembly version (GRCh38) do not match If using human GRCh37 add "--port 3337" to use the GRCh37 database, or --offline to avoid database connection entirely
For users looking to move their data between assemblies, Ensembl provides an assembly converter tool - if you've downloaded the VEP, then you have it already! The script is found in the ensembl-tools/scripts/assembly_converter folder. There is also an online version of the tool available. Both UCSC (liftOver) and NCBI (Remap) also provide tools for converting data between assemblies.
The VEP script supports using HGVS notations as input. This feature is currently under development, and not all HGVS notation types are supported. Specifically, only notations relative to genomic (g) or coding (c) sequences are currently supported; protein (p) notations are supported in limited fashion due to the complexity involved in determining the multiple possible underlying genomic sequence changes that could produce a single protein change. The script will warn the user if it fails to parse a particular notation.
By default the VEP script uses Ensembl transcripts as its reference for determining consequences, and hence also for HGVS notations. However, it is possible to parse HGVS notations that use RefSeq transcripts as the reference sequence by using the --refseq flag when running the script. Such notations must include the version number of the transcript e.g.
where ".3" denotes that this is version 3 of the transcript NM_080794. See below for more details on how the VEP can use RefSeq transcripts.
Ensembl produces Core schema databases containing alignments of RefSeq transcript objects to the reference genome. This is the otherfeatures database, and is produced for human and mouse. The database also contains alignments of CCDS transcripts and Ensembl EST sequences - they may be included in your analysis using --all_refseq. By passing the --refseq flag when running the VEP script, these alternative transcripts will be used as the reference for predicting variant consequences. Gene IDs given in the output when using this option are generally NCBI GeneIDs.
Users should note that RefSeq sequences may disagree with the reference sequence to which they are aligned, hence results generated when using this option should be interpreted with a degree of caution. A much more complex and stringent process is used to produce the main Ensembl Core database, and this should be used in preference to the RefSeq transcripts.
SIFT and PolyPhen predictions and scores are now calculated and referred to internally using the translated sequence, so predictions are available using the --refseq flag where the RefSeq translation matches the Ensembl translation (they will match in the vast majority of cases - most differences between Ensembl and RefSeq transcripts occur in non-coding regions).
The VEP script can be used to convert files between the various formats that it parses. This may be useful for a user with, for example, a number of variants given in HGVS notation against RefSeq transcript identifiers. The conversion process allows these notations to be converted into genomic reference coordinates, and then used to predict consequences in the VEP against Ensembl transcripts.