EnsemblEnsembl Home

Variant Effect Predictor Tutorial


Basic tutorial

NOTE: If you're using a UNIX or Mac system, you can dive straight into this tutorial by opening your favourite terminal application. If you're on Windows you might like to have a look at the guide for Windows users before starting.

Have you downloaded the VEP yet? Click this link to download if you haven't already. If you prefer to do things on the terminal, you can use wget to grab the VEP archive:

> wget "https://github.com/Ensembl/ensembl-tools/archive/release/77.zip"

If you clicked the download link, let's find the file and unpack it ourselves. Typically the file will be in your Downloads folder:

> cd ~/Downloads

Now let's unpack it and navigate to the VEP's sub-directory:

> unzip ensembl-tools-release-77.zip
> cd ensembl-tools-release-77/scripts/variant_effect_predictor/

NOTE: depending on your setup the archive file may instead be named 77.zip

The VEP uses "cache files" or a remote database to read genomic data. Using cache files gives the best performance - let's set one up using the installer:

> perl INSTALL.pl
Hello! This installer is configured to install v77 of the Ensembl API for use by the VEP.
It will not affect any existing installations of the Ensembl API that you may have.

It will also download and install cache files from Ensembl's FTP server.

Checking for installed versions of the Ensembl API...done
It looks like you already have v77 of the API installed.
You shouldn't need to install the API

Skip to the next step (n) to install cache files

Do you want to continue installing the API (y/n)?

If you haven't yet installed the API, type "y" followed by enter, otherwise type "n" (perhaps if you ran the installer before). At the next prompt, type "y" to install cache files

Do you want to continue installing the API (y/n)? n
 - skipping API installation

The VEP can either connect to remote or local databases, or use local cache files.
Cache files will be stored in /nfs/users/nfs_w/wm2/.vep
Do you want to install any cache files (y/n)? y

Downloading list of available cache files
The following species/files are available; which do you want (can specify multiple separated by spaces):
1 : ailuropoda_melanoleuca_vep_77_ailMel1.tar.gz
2 : anas_platyrhynchos_vep_77_BGI_duck_1.0.tar.gz
3 : anolis_carolinensis_vep_77_AnoCar2.0.tar.gz
...
24 : homo_sapiens_vep_77_GRCh38.tar.gz
...

?

Type "24" (or the relevant number for homo_sapiens and GRCh38) to install the cache for the latest human assembly. This will take a little while to download and unpack! By default the VEP assumes you are working in human; it's easy to switch to any other species using --species [species].

? 24
 - downloading ftp://ftp.ensembl.org/pub/release-77/variation/VEP/homo_sapiens_vep_77_GRCh38.tar.gz
 - unpacking homo_sapiens_vep_77_GRCh38.tar.gz

Success

By default the VEP installs cache files in a folder in your home area ($HOME/.vep); you can easily change this using the -d flag when running the install script. Have a look at the installer documentation for more details.

The VEP needs some input containing variant positions to run. In their most basic form, this should just be a chromosomal location and a pair of alleles (reference and alternate). The VEP can also use common formats such as VCF and pileup as input; these are commonly produced by variant calling packages. Have a look at the Data formats page for more information.

We can now use our cache file to run the VEP on the supplied example file example_GRCh38.vcf, which is a VCF file containing variants from the 1000 Genomes Project exon-sequencing pilot, remapped to GRCh38:

> perl variant_effect_predictor.pl -i example_GRCh38.vcf --cache
2013-07-31 09:17:54 - Read existing cache info
2013-07-31 09:17:54 - Starting...
ERROR: Output file variant_effect_output.txt already exists. Specify a different output file
with --output_file or overwrite existing file with --force_overwrite

You may see this error message if you've already run the script once. The VEP tries not to trample over your existing files unless you tell it to. So let's tell it to using --force_overwrite

> perl variant_effect_predictor.pl -i example_GRCh38.vcf --cache --force_overwrite
2013-07-31 09:19:30 - Read existing cache info
2013-07-31 09:19:30 - Starting...
2013-07-31 09:19:30 - Detected format of input file as vcf
2013-07-31 09:19:30 - Read 173 variants into buffer
2013-07-31 09:19:30 - Reading transcript data from cache and/or database
[================================================================================]  [ 100% ]
2013-07-31 09:19:31 - Retrieved 3037 transcripts (0 mem, 3106 cached, 0 DB, 69 duplicates)
2013-07-31 09:19:31 - Analyzing chromosome 21
2013-07-31 09:19:31 - Analyzing variants
[================================================================================]  [ 100% ]
2013-07-31 09:19:32 - Calculating consequences
[================================================================================]  [ 100% ]
2013-07-31 09:19:32 - Analyzing chromosome 22
2013-07-31 09:19:32 - Analyzing variants
[================================================================================]  [ 100% ]
2013-07-31 09:19:33 - Calculating consequences
[================================================================================]  [ 100% ]
2013-07-31 09:19:34 - Processed 173 total variants (43 vars/sec, 43 vars/sec total)
2013-07-31 09:19:34 - Wrote stats summary to variant_effect_output.txt_summary.html
2013-07-31 09:19:34 - Finished!

Note that the VEP has told us it retrieved 3037 transcripts from the cache - this means the cache we downloaded is working OK.

The VEP produces status output as it runs to give you an idea of the progress it is making. You can silence this output using --quiet, though note that errors and warnings will still be printed out.

By default the VEP writes to a file named "variant_effect_output.txt" - you can change this file name using -o. Let's have a look at the output that the script generated.

> head variant_effect_output.txt
## ENSEMBL VARIANT EFFECT PREDICTOR v77
## Output produced at xxxx-xx-xx 09:19:30
## Connected to homo_sapiens_core_77_38 on ensembldb.ensembl.org
## Using cache in /nfs/users/nfs_w/wm2/.vep/homo_sapiens/77
## Using API version 77, DB version 77
## Extra column keys:
## DISTANCE : Shortest distance from variant to transcript
#Uploaded_variation  Location      Allele  Gene            Feature         Feature_type    Consequence ...
rs116645811          21:25587758   A       ENSG00000260583 ENST00000567517 Transcript      upstream_gene_variant ...
rs116645811          21:25587758   A       ENSG00000154719 ENST00000352957 Transcript      intron_variant ...

The lines starting with "#" are header or meta information lines. The final one of these (highlighted in blue above) gives the column names for the data that follows. To see more information about the VEP's output format, see the Data formats page.

We can see two lines of output here, both for the uploaded variant named rs116645811. In many cases, a variant will fall in more than one transcript. Typically this is where a single gene has multiple splicing variants; however, as in this case, it can also be that the variant overlaps transcripts from two different genes. Here our variant has a consequence for the genes ENSG00000260583 and ENSG00000154719.

In the consequence column, we can see two different terms, upstream_gene_variant and intron_variant. These terms are standardised terms for describing the effects of sequence variants on genomic features, produced by the Sequence Ontology (SO). See our predicted data page for a guide to the consequence types that the VEP and Ensembl uses.

Let's try something a little more interesting. SIFT is an algorithm for predicting whether a given change in a protein sequence will be deleterious to the function of that protein. The VEP can give SIFT predictions for any of the missense variants that it predicts. To do this, simply add --sift b (the b means we want both the prediction and the score):

> perl variant_effect_predictor.pl -i example_GRCh38.vcf --cache --force_overwrite --sift b

SIFT calls variants either "deleterious" or "tolerated". We can use the VEP's filtering script to find only those that SIFT considers deleterious:

> perl filter_vep.pl -i variant_effect_output.txt -filter "SIFT is deleterious" | head -n15
...
## SIFT : SIFT prediction
#Uploaded_variation     Location        Allele  Gene            Feature         ...  Extra
rs114053718             21:32656885     G       ENSG00000159082 ENST00000382499 ...  SIFT=deleterious(0)
rs114053718             21:32656885     G       ENSG00000159082 ENST00000382491 ...  SIFT=deleterious(0)
rs114053718             21:32656885     G       ENSG00000159082 ENST00000322229 ...  SIFT=deleterious(0)
rs114053718             21:32656885     G       ENSG00000159082 ENST00000433931 ...  SIFT=deleterious(0)

Note that the SIFT score appears in the "Extra" column, as a key/value pair. This column can contain multiple key/value pairs depending on the options you give to the VEP script. See the Data formats page for more information on the fields in the Extra column.

You can also configure how the VEP writes its output using the --fields flag.

You'll also see that we have multiple results for the same gene, ENSG00000159082. Let's say we're only interested in what is considered the canonical transcript for this gene (--canonical), and that we want to know what the commonly used gene symbol from HGNC is for this gene (--symbol). We can also use a UNIX pipe to pass the output from the VEP directly into the filtering script:

> perl variant_effect_predictor.pl -i example_GRCh38.vcf --cache --force_overwrite --sift b --canonical --symbol \
--fields Uploaded_variation,SYMBOL,CANONICAL,SIFT -o STDOUT | \
perl filter_vep.pl -format vep -filter "CANONICAL is YES and SIFT is deleterious"
...
#Uploaded_variation  SYMBOL    CANONICAL  SIFT
rs114053718          SYNJ1     YES        deleterious(0)
rs2254562            SYNJ1     YES        deleterious(0.04)
rs4986958            IFNGR2    YES        deleterious(0.03)
rs16994704           PIGP      YES        deleterious(0.01)
...
rs115264708          PHF21B    YES        deleterious(0.04)

So now we can see all of the variants that have a deleterious effect on canonical transcripts, and the symbol for their genes. Nice!


Over to you!

This has been just a short introduction to the capabilities of the VEP - have a look through some more of the options, see them all on the command line using --help, or try using the shortcut --everything which switches on almost all available output fields! Try out the different options in the filtering script, and if you're feeling adventurous why not use some of your own data to annotate your variants or have a go with a plugin or two.