EnsemblEnsembl Home

Variant Effect Predictor Annotation sources


VEP can use a variety of annotation sources to retrieve the transcript models used to predict consequence types.

  • Cache - a downloadable file containing all transcript models, regulatory features and variant data for a species
  • GFF or GTF - use transcript models defined in a tabix-indexed GFF or GTF file
  • Database - connect to a MySQL database server hosting Ensembl databases

Data from VCF, BED and bigWig files can also be incorporated by VEP's custom annotation feature.

Using a cache is the most efficient way to use VEP; we would encourage you to use a cache wherever possible. Caches are easy to download and set up using the installer. Follow the tutorial for a simple guide.

Caches

Using a cache (--cache) is the fastest and most efficient way to use VEP, as in most cases only a single initial network connection is made and most data is read from local disk. Use offline mode to eliminate all network connections for speed and/or privacy.

Downloading caches

Ensembl creates cache files for every species for each Ensembl release. They can be automatically downloaded and configured using INSTALL.pl.

If interested in RefSeq transcripts you may download an alternate cache file (e.g. homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify --refseq or --merged when running VEP to use the relevant cache. See documentation for full details.

Manually downloading caches

It is also simple to download and set up caches without using the installer. By default, VEP searches for caches in $HOME/.vep; to use a different directory when running VEP, use --dir_cache.

cd $HOME/.vep
curl -O ftp://ftp.ensembl.org/pub/release-90/variation/VEP/homo_sapiens_vep_90_GRCh38.tar.gz
tar xzf homo_sapiens_vep_90_GRCh38.tar.gz

FTP directories by species grouping:

Limitations of the cache

The cache stores the following information:

  • Transcript location, sequence, exons and other attributes
  • Gene, protein, HGNC and other identifiers for each transcript (where applicable, limitations apply to RefSeq caches)
  • Locations, alleles and frequencies of existing variants
  • Regulatory regions
  • Predictions and scores for SIFT, PolyPhen

It does not store any information pertaining to, and therefore cannot be used for, the following:

  • HGVS names (--hgvs, --hgvsg) - to retrieve these you must additionally point to a FASTA file containing the reference sequence for your species (--fasta)
  • Using HGVS notation as input (--format hgvs)
  • Using variant identifiers as input (--format id)
  • Finding overlapping structural variants (--check_sv)

Enabling one of these options with --cache will cause VEP to warn you in its status output with something like the following:

 2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs 

Convert with tabix

For those with Bio::DB::HTS (as set up by INSTALL.pl) or tabix installed on their systems, the speed of retrieving existing co-located variants can be greatly improved by converting the cache files using the supplied script, convert_cache.pl. This replaces the plain-text, chunked variant dumps with a single tabix-indexed file per chromosome. The script is simple to run:

perl convert_cache.pl -species [species] -version [vep_version]

To convert all species and all versions, use "all":

perl convert_cache.pl -species all -version all

A full description of the options can be seen using --help. When complete, VEP will automatically detect the converted cache and use this in place.

Note that tabix and bgzip must be installed on your system to convert a cache. INSTALL.pl downloads these when setting up Bio::DB::HTS; to enable convert_cache.pl to find them, run:

export PATH=${PATH}:${PWD}/htslib

Data privacy and offline mode

When using the public database servers, VEP requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be appropriate for those with sensitive or private data. Users should note that only the coordinates are transmitted to the server; no other information is sent.

To run VEP in an offline mode that does not use any network connections, use the flag --offline.

The limitations described above apply absolutely when using offline mode. For example, if you specify --offline and --format id, VEP will report an error and refuse to run:

ERROR: Cannot use ID format in offline mode

All other features, including the ability to use custom annotations and plugins, are accessible in offline mode.


GFF/GTF files

VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix, and VEP requires a FASTA file containing the genomic sequence in order to generate transcript models.

Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.

grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
./vep -i input.vcf -gff data.gff.gz -fasta genome.fa.gz

You may use any number of GFF/GTF files in this way, providing they refer to the same genome. You may also use them in concert with annotations from a cache or database source; annotations are distinguished by the SOURCE field in the VEP output:

./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz

This functionality uses VEP's custom annotation feature, and the --gff flag is a shortcut to:

--custom data.gff.gz,,gff

You should use the longer form if you wish to customise the name of the GFF as it appears in the SOURCE field and VEP output header.

GFF format expectations

VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, VEP may encounter problems parsing some GFF files. For the same reason, not all transcript biotypes defined in your GFF may be supported by VEP. VEP does not support GFF files with embedded FASTA sequence.

The following entity types (3rd column in the GFF) are supported by VEP. Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded. [Show supported types]

Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF. Unlinked entities (i.e. those with no parents or children) are discarded. Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities.

Transcripts require a Sequence Ontology biotype to be defined in order to be parsed by VEP. The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for VEP to be able to parse GFF files from NCBI and other sources.

GTF format expectations

The following GTF entity types will be parsed by VEP:

  • cds (or CDS)
  • stop_codon
  • exon
  • gene
  • transcript

Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.

Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype.

Chromosome synonyms

If the chromosome names used in your GFF/GTF differ from those used in the FASTA or your input VCF, you may see warnings like this when running VEP:

WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160

To circumvent this you may provide VEP with a synonyms file. A synonym file is included in VEP's cache files, so if you have one of these for your species you can use it as follows:

./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz -synonyms ~/.vep/homo_sapiens/90_GRCh38/chr_synonyms.txt

Limitations

Using a GFF or GTF file as VEP's annotation source limits access to some auxiliary information available when using a cache. Currently most external reference data such as gene symbols, transcript identifiers and protein domains are inaccessible when using only a GFF/GTF file.

VEP's flexibility does allow some annotation types to be replaced. The following table illustrates some examples and alternative means to retrieve equivalent data.

Data type Alternative
SIFT and PolyPhen predictions (--sift, --polyphen) Use the PolyPhen_SIFT VEP plugin
Co-located variants (--check_existing, --af* flags) A couple of options are available:
  1. Use a VCF with --custom to retrieve variant IDs, frequency and other data
  2. Add --cache to use variants in the cache. *
Regulatory consequences (--regulatory) Add --cache to use regulatory features in the cache. *

* Note this will also instruct VEP to annotate input variants against transcript models retrieved from the cache as well as those from the GFF/GTF file. It is possible to use --transcript_filter to include only the transcripts from your GFF/GTF file:

./vep -i input.vcf -cache -custom data.gff.gz,myGFF,gff -fasta genome.fa.gz -transcript_filter "_source_cache is myGFF"

FASTA files

By pointing VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using --cache or --offline. This enables VEP to retrieve HGVS notations (--hgvs), check the reference sequence given in input data (--check_ref), and construct transcript models from a GFF or GTF file without accessing a database.

FASTA files can be set up using the installer; files set up using the installer are automatically detected by VEP when using --cache or --offline; you should not need to use --fasta to manually specify them.

To enable this VEP uses one of two modules:

  • The Bio::DB::HTS Perl XS module with HTSlib. This module uses compiled C code and can access compressed (bgzipped) or uncompressed FASTA files. It is set up by the VEP installer.
  • The Bio::DB::Fasta module. This may be used on systems where installation of the Bio::DB::HTS module has not been possible. It can access only uncompressed FASTA files. It is also set up by the VEP installer and comes as part of the BioPerl package.

The first time you run VEP with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, VEP will force a rebuild of the index).

Ensembl provides suitable reference FASTA files as downloads from its FTP server. See the Downloads page for details. You should preferably use the installer as described above to fetch these files; manual instructions are provided for reference. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without "_rm" or "_sm" in the name) sequences. Note that VEP requires that the file be either unzipped (Bio::DB::Fasta) or unzipped and then recompressed with bgzip (Bio::DB::HTS::Faidx) to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:

curl -O ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa
./vep -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Databases

VEP can use remote or local database servers to retrieve annotations.

  • Using --cache (without --offline) uses the local cache on disk to fetch most annotations, but allows database connections for some features (see cache limitations)
  • Using --database tells VEP to retrieve all annotations from the database. Please only use this for small input files or when using a local database server!

Public database servers

By default, VEP is configured to connect to Ensembl's public MySQL instance at ensembldb.ensembl.org. For users in the US (or for any user geographically closer to the East coast of the USA than to Ensembl's data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org

Users of Ensembl Genomes species (e.g. plants, fungi, microbes) should use their public MySQL instance; the connection parameters for this can be automatically loaded by using the flag --genomes

Users with small data sets (100s of variants) should find using the default connection settings adequate. Those with larger data sets, or those who wish to use VEP in a batch manner, should consider one of the alternatives below.

Using a local database

It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run VEP (this can be the same machine). For most of the functionality of VEP, you will only need the Core database (e.g. homo_sapiens_core_90_38) installed. In order to find co-located variants or to use SIFT or PolyPhen, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_90_38).

Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.

To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "core",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_core_90_38'
);

Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "variation",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_variation_90_38'
);

Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");
	

For more information on the registry and registry files, see here.


Cache - technical information

ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.

The following hash keys are deleted from each transcript object:

  • analysis
  • created_date
  • dbentries : this contains the external references retrieved when calling $transcript->get_all_DBEntries(); hence this call on a cached object will return no entries
  • description
  • display_xref
  • edits_enabled
  • external_db
  • external_display_name
  • external_name
  • external_status
  • is_current
  • modified_date
  • status
  • transcript_mapper : used to convert between genomic, cdna, cds and protein coordinates. A copy of this is cached separately by VEP as
    $transcript->{_variation_effect_feature_cache}->{mapper}

As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:

  • introns : listref of intron objects for the transcript. The adaptor, analysis, dbID, next, prev and seqname keys are stripped from each intron object
  • translateable_seq : as returned by
    $transcript->translateable_seq
  • mapper : transcript mapper as described above
  • peptide : the translated sequence as a string, as returned by
    $transcript->translate->seq
  • protein_features : protein domains for the transcript's translation as returned by
    $transcript->translation->get_all_ProteinFeatures
    Each protein feature is stripped of all keys but: start, end, analysis, hseqname
  • codon_table : the codon table ID used to translate the transcript, as returned by
    $transcript->slice->get_all_Attributes('codon_table')->[0]
  • protein_function_predictions : a hashref containing the keys "sift" and "polyphen"; each one contains a protein function prediction matrix as returned by e.g.
    $protein_function_prediction_matrix_adaptor->fetch_by_analysis_translation_md5('sift', md5_hex($transcript-{_variation_effect_feature_cache}->{peptide}))

Similarly, some further data is cached directly on the transcript object under the following keys:

  • _gene : gene object. This object has all keys but the following deleted: start, end, strand, stable_id
  • _gene_symbol : the gene symbol
  • _ccds : the CCDS identifier for the transcript
  • _refseq : the "NM" RefSeq mRNA identifier for the transcript
  • _protein : the Ensembl stable identifier of the translation
  • _source_cache : the source of the transcript object. Only defined in the merged cache (values: Ensembl, RefSeq) or when using a GFF/GTF file (value: short name or filename)