Variant Effect Predictor Caches and databases
The VEP script can use a variety of data sources to retrieve transcript information that is used to predict consequence types.
Using a local cache is the most efficient way to use the VEP; we would encourage users to use the cache wherever possible. Caches are easy to download and set up using the installer. Follow the tutorial for a simple guide.
Using the cache
Using the cache (--cache) is the fastest and most efficient way to use the VEP script, as in most cases only a single initial network connection is made and most data is read from local disk. Use offline mode eliminate all network connections.
Cache files are compressed using the gzip utility. By default zcat is used to decompress the files, although gzcat or gzip itself can be used to decompress also - you must have one of these utilities installed in your path to use the cache. Use --compress [command] to change the default.
The easiest solution is to download a pre-built cache for your species; this eliminates the need to connect to the database while the script is running (except when using certain options). Cache files can either be downloaded and unpacked as described here, or automatically downloaded and configured using the installer script.
Users interested in RefSeq transcripts may download an alternate cache file (eg homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify --refseq or --merged when running the VEP to use the relevant cache.
Cache files for popular species:
- Human (Homo sapiens) - GRCh37
- Human (Homo sapiens) - GRCh38
- Mouse (Mus musculus) - GRCm38
- Zebrafish (Danio rerio) - Zv9
NB: When using Ensembl Genomes caches, you should use the --cache_version option to specify the relevant Ensembl Genomes version number as these differ from the concurrent Ensembl/VEP version numbers.
Instructions for use
- Download the archive file for your species
Extract the archive in your cache directory. By default the VEP uses
$HOME/.vep/ as the cache directory, where $HOME is your UNIX home
mv homo_sapiens_vep_80.tar.gz ~/.vep/ cd ~/.vep/ tar xfz homo_sapiens_vep_80.tar.gz
- Run the VEP with the --cache option
Caches for several species, and indeed different Ensembl releases of the same species, can be stored in the same cache base directory. The files are stored in the following directory hierarchy: $HOME -> .vep -> species -> version -> chromosome
If a pre-built cache does not exist for your species, please contact the ensembl-dev mailing list and we will endeavour to add your species to the list of downloads.
It is also possible to build your own cache from a GTF file and a FASTA file.
It is possible to use any combination of cache and database; when using the cache, the cache will take preference, with the database being used when the relevant data is not found in the cache.
Building a cache from a GTF file
For species that don't have a publicly available cache, or even an Ensembl core database, it is possible to build a VEP cache using the gtf2vep.pl script included alongside the main script. This requires a GTF file and a FASTA file containing the reference sequence for the same species. The GTF file have its features sorted in chromosomal order, and must contain the following entry types and data:
- "exon" feature lines must have "transcript_id", "gene_id" and "exon_number" in the description field
- "CDS" feature lines must correspond to coding exons, with "transcript_id" and "exon_number" populated in the description field, and the "frame" column populated with the read frame of the exon
- the "source" column of the GTF should indicate the biotype of the exon/CDS - this should be "protein_coding" for protein coding transcripts
For examples of these, see the GTF files made available by Ensembl on the Downloads page
The FASTA file should contain sequence for all values of the seqname column in the GTF file. The first time you run the script, the Bio::DB::Fasta module will create an index for the file that allows rapid random access to the sequences corresponding to the transcripts described in the GTF file; this may take a few minutes to create.
The script is run as follows:
perl gtf2vep.pl -i my_species_genes.gtf -f my_species_seq.fa -d 80 -s my_species perl variant_effect_predictor.pl -offline -i my_species_variants.vcf -s my_species
By default the cache is created in $HOME/.vep/[species]/[version]/ - to change this root directory, use --dir.
This process takes around 15-20 minutes for human (including the time taken to index the FASTA file). Note that caches created in this way can only be used in offline mode.
Using FASTA files
By pointing the VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using --cache or --offline. This enables the VEP to retrieve HGVS notations (--hgvs) and check the reference sequence given in input data (--check_ref) without accessing a database.
FASTA files can be set up using the installer; files set up using the installer are automatically detected by the VEP when using --cache or --offline; you should not need to use --fasta to manually specify them.
To enable this the VEP uses the Bio::DB::Fasta module. The first time you run the script with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, the VEP will force a rebuild of the index).
Ensembl provides suitable reference FASTA files as downloads from its FTP server. See the Downloads page for details. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without "_rm" or "_sm" in the name) sequences. Note that the VEP requires that the file be unzipped to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:
wget ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz perl variant_effect_predictor.pl -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa
Convert with tabix
For users with tabix installed on their systems, the speed of retrieving existing co-located variants can be greatly improved by converting the cache files using the supplied script, convert_cache.pl. This replaces the plain-text, chunked variant dumps with a single tabix-indexed file per chromosome. The script is simple to run:
perl convert_cache.pl -species [species] -version [vep_version]
To convert all species and all versions, use "all":
perl convert_cache.pl -species all -version all
A full description of the options can be seen using --help. When complete, the VEP should automatically detect the converted cache and use this in place. Note that tabix and bgzip must be installed on your system to use a converted cache.
Limitations of the cache
The cache stores the following information:
- Transcript location, sequence, exons and other attributes
- Gene, protein and HGNC identifiers for each transcript (where applicable)
- Location and alleles of existing variations
- Regulatory regions
- Predictions and scores for SIFT, PolyPhen
It does not store any information pertaining to, and therefore cannot be used for, the following:
- Frequency filtering of input (--check_frequency) on populations not included in the cache. The human cache currently includes frequency data for the combined 1000 Genomes phase 1 population (ALL), the continental level populations (AFR, AMR, ASN, EUR), and the two NHLBI-ESP populations (AA, EA). It does not contain frequencies for national level (e.g. CEU, YRI) populations.
- HGVS names (--hgvs) - to retrieve these you must additionally point to a FASTA file containing the reference sequence for your species (--fasta)
- Using HGVS notation as input (--format hgvs)
- Using variant identifiers as input (--format id)
- Finding overlapping structural variants (--check_sv)
Enabling one of these options with --cache will cause the script to warn you in its status output with something like the following:
2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs
When using the public database servers, the VEP script requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be suitable for those with sensitive or private data. Users should note that only the coordinates are transmitted to the server; no other information is sent.
By using a full downloaded cache (preferably in offline mode) or a local database, it is possible to avoid completely any network connections to public servers, thus preserving absolutely the privacy of your data.
It is possible to run the VEP in a offline mode that does not use the database, and does not require a standard installation of the Ensembl API. This means users require only perl (version 5.8 or greater) and the either zcat, gzcat or gzip utilities. To enable this mode, use the flag --offline.
The simplest way to set up your system is to use the installer script, INSTALL.pl. This will download the required dependencies to your system, and download and set up any cache files that you require.
Public database servers
By default, the script is configured to connect to Ensembl's public MySQL instance at ensembldb.ensembl.org. For users in the US (or for any user geographically closer to the East coast of the USA than to Ensembl's data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org
Users of Ensembl Genomes species (e.g. plants, fungi, microbes) should use their public MySQL instance; the connection parameters for this can be automatically loaded by using the flag --genomes
Users with small data sets (100s of variants) should find using the default connection settings adequate. Those with larger data sets, or those who wish to use the script in a batch manner, should consider one of the alternatives below.
Using a local database
It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run the script (this can be the same machine). For most of the functionality of the VEP, you will only need the Core database (e.g. homo_sapiens_core_80_38) installed. In order to find co-located variations or to use SIFT or PolyPhen, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_80_38).
Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.
To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:
use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::Registry; Bio::EnsEMBL::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "core", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_core_80_38' ); Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "variation", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_variation_80_38' ); Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");
For more information on the registry and registry files, see here.
Building your own cache
It is possible to build your own cache using the VEP script. You should NOT use this command when connected to the public MySQL instances - the process takes a long time, meaning the connection can break unexpectedly and you will be violating Ensembl's reasonable use policy on the public servers. You should either download one of the pre-built caches, or create a local copy of your database of interest to build the cache from.
You may wish to build a full cache if you have a custom Ensembl database with data not found on the public servers, or you may wish to create a minimal cache covering only a certain set of chromosome regions. Cache files are compressed using the gzip utility; this must be installed in your path to write cache files.
To build a cache "on-the-fly", use the --cache and --write_cache flags when you run the VEP with your input. Only cache files overlapping your input variants will be created; the next time you run the script with this cache, the data will be read from the cache instead of the database. Any data not found in the cache will be read from the database (and then written to the cache if --write_cache is enabled). If your data covers a relatively small proportion of your genome of interest (for example, a few genes of interest), it can be OK to use the public MySQL servers when building a partial cache.
perl variant_effect_predictor.pl -cache -dir /my/cache/dir/ -write_cache -i input.txt
perl variant_effect_predictor.pl -host dbhost -user username -pass password -port 3306 -build 21 -dir /my/cache/dir/
|Normal mode||Cache mode||Build mode|
ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.
The following hash keys are deleted from each transcript object:
- dbentries : this contains the external references retrieved when calling $transcript->get_all_DBEntries(); hence this call on a cached object will return no entries
- transcript_mapper : used to convert between genomic, cdna,
cds and protein coordinates. A copy of this is cached separately
by the VEP as
As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by the VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:
- introns : listref of intron objects for the transcript. The adaptor, analysis, dbID, next, prev and seqname keys are stripped from each intron object
- translateable_seq : as returned by
- mapper : transcript mapper as described above
- peptide : the translated sequence as a string, as returned by
- protein_features : protein domains for the transcript's translation
as returned by
$transcript->translation->get_all_ProteinFeaturesEach protein feature is stripped of all keys but: start, end, analysis, hseqname
- codon_table : the codon table ID used to translate the transcript,
as returned by
- protein_function_predictions : a hashref containing the keys "sift"
and "polyphen"; each one contains a protein function prediction matrix
as returned by e.g.
Similarly, some further data is cached directly on the transcript object under the following keys:
- _gene : gene object. This object has all keys but the following deleted: start, end, strand, stable_id
- _gene_symbol : the gene symbol
- _ccds : the CCDS identifier for the transcript
- _refseq : the "NM" RefSeq mRNA identifier for the transcript
- _protein : the Ensembl stable identifier of the translation