EnsemblEnsembl Home

Variant Effect Predictor Running the VEP


The VEP script is run on the command line as follows:

 perl variant_effect_predictor.pl [options] 

where [options] represent a set of flags and options to the script. These can be listed using the flag --help:

 perl variant_effect_predictor.pl --help 

Users should download a cache file for their species of interest, using either the installer script or by following the documentation, and run the VEP with either the --cache or --offline option.

For smaller input files, it is possible for the VEP to connect to Ensembl's public database servers in place of the cache; to enable this, use --database

Most users will need to use only a few of the options described below; for most the following command will be enough to get started with:

 perl variant_effect_predictor.pl --cache -i input.txt -o output.txt 

where input.txt contains data in one of the compatible input formats, and output.txt is the output file created by the script. See Data Formats for more detail on input and output formats.

Options can be passed as the full string (e.g. --format), or as the shortest unique string among the options (e.g. --form for --format, since there is another option --force_overwrite).

Options can also be read from a configuration file - either passively stored as $HOME/.vep/vep.ini, or actively using --config.


Basic options

Flag Alternate Description
--help
  Display help message and quit
--verbose
-v
Output longer status messages as the script runs. This option can be used to generate the basis of a configuration file - see --config below. Not used by default
--quiet
-q
Suppress status and warning messages. Not used by default
--no_progress
  Don't show progress bars. Progress bars shown by default
--config [filename]
  Load configuration options from a config file. The config file should consist of whitespace-separated pairs of option names and settings e.g.:
output_file   my_output.txt
species       mus_musculus
format        vcf
host          useastdb.ensembl.org
A config file can also be implicitly read; save the file as $HOME/.vep/vep.ini (or equivalent directory if using --dir). Any options in this file will be overridden by those specified in a config file using --config, and in turn by any options manually specified on the command line. You can create a quick version file of this by setting the flags as normal and running the script in verbose (-v) mode. This will output lines that can be copied to a config file that can be loaded in on the next run using --config. Not used by default
--everything
  Shortcut flag to switch on all of the following:

--sift b, --polyphen b, --ccds, --uniprot, --hgvs, --symbol, --numbers, --domains, --regulatory, --canonical, --protein, --biotype, --gmaf, --maf_1kg, --maf_esp, --pubmed

--fork [num_forks]
  Enable forking, using the specified number of forks. Forking can dramatically improve the runtime of the script. Not used by default

Input options

Flag Alternate Description
--species [species]
  Species for your data. This can be the latin name e.g. "homo_sapiens" or any Ensembl alias e.g. "mouse". Specifying the latin name can speed up initial database connection as the registry does not have to load all available database aliases on the server. Default = "homo_sapiens"
--assembly [name]
  Select the assembly version to use if more than one available. If using the cache, you must have the appropriate assembly's cache file installed. If not specified and you have only 1 assembly version installed, this will be chosen by default. Default = use found assembly version
--input_file [filename]
-i
Input file name. If not specified, the script will attempt to read from STDIN.
--format [format]
  Input file format - one of "ensembl", "vcf", "pileup", "hgvs", "id", "vep". By default, the script auto-detects the input file format. Using this option you can force the script to read the input file as Ensembl, VCF, pileup or HGVS format, a list of variant identifiers (e.g. rsIDs from dbSNP), or the output from the VEP (e.g. to add custom annotation to an existing results file using --custom). Auto-detects format by default
--output_file [filename]
-o
Output file name. The script can write to STDOUT by specifying STDOUT as the output file name - this will force quiet mode. Default = "variant_effect_output.txt"
--force_overwrite
--force
By default, the script will fail with an error if the output file already exists. You can force the overwrite of the existing file by using this flag. Not used by default
--stats_file [filename]
  Summary stats file name. This is an HTML file containing a summary of the VEP run - the file name must end ".htm" or ".html". Default = "variant_effect_output.txt_summary.html"
--no_stats
  Don't generate a stats file.
--stats_text
  Generate a plain text stats file in place of the HTML.
--html
  Generate an additional HTML version of the output file containing hyperlinks to Ensembl and other resources. File name of this file is [output_file].html

Cache options

Flag Alternate Description
--cache
  Enables use of the cache. Add --refseq to use the refseq cache (if installed).
--dir [directory]
  Specify the base cache/plugin directory to use. Default = "$HOME/.vep/"
--dir_cache [directory]
  Specify the cache directory to use. Default = "$HOME/.vep/"
--dir_plugins [directory]
  Specify the plugin directory to use. Default = "$HOME/.vep/"
--offline
  Enable offline mode. No database connections will be made, and only a complete cache (either downloaded or built using --build) can be used for this mode. Add --refseq to use the refseq cache (if installed). Not used by default
--fasta [file|dir]
  Specify a FASTA file or a directory containing FASTA files to use to look up reference sequence. The first time you run the script with this parameter an index will be built which can take a few minutes. This is required if fetching HGVS annotations (--hgvs) or checking reference sequences (--check_ref) in offline mode (--offline), and optional with some performance increase in cache mode (--cache). See documentation for more details. Not used by default
--cache_version
  Use a different cache version than the assumed default (the VEP version). This should be used with Ensembl Genomes caches since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 74 and the Ensembl Genomes version 21. Not used by default

Output options

Flag Alternate Description
--sift [p|s|b]
  Species limited SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. The VEP can output the prediction term, score or both. Not used by default
--polyphen [p|s|b]
--poly
Human only PolyPhen is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. The VEP can output the prediction term, score or both. The VEP uses the humVar score by default - use --humdiv to retrieve the humDiv score. Not used by default
--humdiv
  Human only Retrieve the humDiv PolyPhen prediction instead of the defaulat humVar. Not used by default
--regulatory
  Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site. Output lines have a Feature type of RegulatoryFeature or MotifFeature. Not used by default
--cell_type
  Report only regulatory regions that are found in the given cell type(s). Can be a single cell type or a comma-separated list. The functional type in each cell type is reported under CELL_TYPE in the output. To retrieve a list of cell types, use --cell_type list. Not used by default
--custom [filename]
  Add custom annotation to the output. Files must be tabix indexed or in the bigWig format. Multiple files can be specified by supplying the --custom flag multiple times. See here for full details. Not used by default
--plugin [plugin name]
  Use named plugin. Plugin modules should be installed in the Plugins subdirectory of the VEP cache directory (defaults to $HOME/.vep/). Multiple plugins can be used by supplying the --plugin flag multiple times. See plugin documentation. Not used by default
--individual [all|ind list]
  Consider only alternate alleles present in the genotypes of the specified individual(s). May be a single individual, a comma-separated list or "all" to assess all individuals separately. Individual variant combinations homozygous for the given reference allele will not be reported. Each individual and variant combination is given on a separate line of output. Only works with VCF files containing individual genotype data; individual IDs are taken from column headers. Not used by default
--phased
  Force VCF genotypes to be interpreted as phased. For use with plugins that depend on phased data. Not used by default
--allele_number
  Identify allele number from VCF input, where 1 = first ALT allele, 2 = second ALT allele etc. Not used by default
--total_length
  Give cDNA, CDS and protein positions as Position/Length. Not used by default
--numbers
  Adds affected exon and intron numbering to to output. Format is Number/Total. Not used by default
--domains
  Adds names of overlapping protein domains to output. Not used by default
--no_escape
  Don't URI escape HGVS strings. Default = escape
--terms [ensembl|so]
-t
The type of consequence terms to output. The Ensembl terms are described here. The Sequence Ontology is a joint effort by genome annotation centres to standardise descriptions of biological sequences. Default = "SO"

Identifiers

Flag Alternate Description
--hgvs
  Add HGVS nomenclature based on Ensembl stable identifiers to the output. Both coding and protein sequence names are added where appropriate. To generate HGVS identifiers when using --cache or --offline you must use a FASTA file and --fasta. Not used by default
--protein
  Add the Ensembl protein identifier to the output where appropriate. Not used by default
--symbol
  Adds the gene symbol (e.g. HGNC) (where available) to the output. Not used by default
--ccds
  Adds the CCDS transcript identifer (where available) to the output. Not used by default
--uniprot
  Adds identifiers for translated protein products from three UniProt-related databases (SWISSPROT, TREMBL and UniParc) to the output. Not used by default
--canonical
  Adds a flag indicating if the transcript is the canonical transcript for the gene. Not used by default
--biotype
  Adds the biotype of the transcript. Not used by default
--xref_refseq
  Output aligned RefSeq mRNA identifier for transcript. NB: theRefSeq and Ensembl transcripts aligned in this way MAY NOT, AND FREQUENTLY WILL NOT, match exactly in sequence, exon structure and protein product. Not used by default

Co-located variants

Flag Alternate Description
--check_existing
  Checks for the existence of known variants that are co-located with your input. By default the alleles are not compared - to do so, use --check_alleles. Not used by default
--check_alleles
  When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel. For example, if the user input has alleles A/G, and an existing co-located variant has alleles A/C, the co-located variant will not be reported.

Strand is also taken into account - in the same example, if the user input has alleles T/G but on the negative strand, then the co-located variant will be reported since its alleles match the reverse complement of user input. Not used by default
--check_svs
  Checks for the existence of structural variants that overlap your input. Currently requires database access. Not used by default
--gmaf
  Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output. Not used by default
--maf_1kg
  Add allele frequency from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output. Note the reported allele(s) and frequencies are for the non-reference allele from the original data, not necessarily the alternate allele from user input. Must be used with --cache Not used by default
--maf_esp
  Include allele frequency from NHLBI-ESP populations. Note the reported allele(s) and frequencies are for the non-reference allele from the originial data, not necessarily the alternate allele from user input. Must be used with --cache Not used by default
--old_maf
  For --maf_1kg and --maf_esp report only the frequency (no allele) and convert this frequency so it is always a minor frequency, i.e. < 0.5
--pubmed
  Report Pubmed IDs for publications that cite existing variant. Must be used with --cache. Not used by default
--failed
  When checking for co-located variants, by default the script will exclude variants that have been flagged as failed. Set this flag to include such variants. Exclude by default

Data format options

Flag Alternate Description
--vcf
  Writes output in VCF format. Consequences are added in the INFO field of the VCF file, using the key "CSQ". Data fields are encoded separated by "|"; the order of fields is written in the VCF header. Output fields can be selected by using --fields.

If the input format was VCF, the file will remain unchanged save for the addition of the CSQ field (unless using any filtering).

Custom data added with --custom are added as separate fields, using the key specified for each data file.

Commas in fields are replaced with ampersands (&) to preserve VCF format.

Not used by default
--json
  Writes output in JSON format. Not used by default
--gvf
  Writes output in GVF format. Not used by default
--original
  Writes output as a filtered set of the input. Must be used with --filter. Input lines are unchanged - consequences are calculated but not written to the output. Not used by default
--fields [list]
  Configure the output format using a comma separated list of fields. Fields may be those present in the default output columns, or any of those that appear in the Extra column (including those added by plugins or custom annotations). Output remains tab-delimited. Not used by default
--convert [format]
  Converts the input file to the specified format (one of "ensembl", "vcf", "pileup"). See documentation for more details. Converted output is written to the file specified with --output_file. No consequence prediction is carried out. Not used by default

Filtering and QC options

NOTE: The filtering options here filter your results before they are written to your output file. Using the VEP's filtering script, it is possible to filter your results after the VEP has run. This way you can retain all of the results and run multiple filter sets on the same results to find different data of interest.

Flag Alternate Description
--check_ref
  Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database. Lines that do not match are skipped. Not used by default
--coding_only
  Only return consequences that fall in the coding regions of transcripts. Not used by default
--chr [list]
  Select a subset of chromosomes to analyse from your file. Any data not on this chromosome in the input will be skipped. The list can be comma separated, with "-" characters representing an interval. For example, to include chromsomes 1, 2, 3, 10 and X you could use --chr 1-3,10,X Not used by default
--no_intergenic
  Do not include intergenic consequences in the output. Not used by default
--pick
  Pick once line of consequence data per variant, including transcript-specific columns. Consequences are chosen by the canonical, biotype status and length of the transcript, along with the ranking of the consequence type according to this table. This is the best method to use if you are interested only in one consequence per variant. Not used by default
--pick_allele
  Like --pick, but chooses one line of consequence data per variant allele. Will only differ in behaviour from --pick when the input variant has multiple alternate alleles. Not used by default
--most_severe
  Output only the most severe consequence per variation. Transcript-specific columns will be left blank. Consequence ranks are given in this table. Not used by default
--summary
  Output only a comma-separated list of all observed consequences per variation. Transcript-specific columns will be left blank. Not used by default
--per_gene
  Output only the most severe consequence per gene. The transcript selected is arbitrary if more than one has the same predicted consequence. Consequence ranks are given in this table. Not used by default
--filter_common
  Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters. For human, this can be used in offline mode for the following populations: 1KG_ALL, 1KG_AFR, 1KG_AMR, 1KG_ASN, 1KG_EUR. Not used by default
--check_frequency
  Turns on frequency filtering. Use this to include or exclude variants based on the frequency of co-located existing variants in the Ensembl Variation database. You must also specify all of the --freq flags below. Using this option requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve frequency data. Frequencies used in filtering are added to the output under the FREQS key in the Extra field. Not used by default
--freq_pop [pop]
  Name of the population to use in frequency filter. This can be the name of the population as it appears on the Ensembl website (suitable for most species), or in the following short form for human. 1000 genomes populations are currently pilot 1 (low coverage).

Example value for --freq_popDescription
1kg_all1000 genomes combined population (global)
1kg_afr1000 genomes combined African populations (also amr, asn, eur)
1kg_chb1000 genomes CHB population
hapmap_yriHapMap YRI population
1kgAny 1000 genomes phase 1 population
ceuAny of HapMap or 1000 genomes CEU populations
anyAny HapMap or 1000 genomes population

--freq_freq [freq]
  Minor allele frequency to use for filtering. Must be a float value between 0 and 0.5
--freq_gt_lt [gt|lt]
  Specify whether the frequency of the co-located variant must be greater than (gt) or less than (lt) the value specified with --freq_freq
--freq_filter [exclude|include]
  Specify whether to exclude or include only variants that pass the frequency filter
--filter [filters]
  Filter the output on consequence type. Multiple allowed types can be specified, separated by commas. SO terms should ideally be used, although Ensembl and NCBI types are also allowed. Consequence types can be excluded by adding "no_" to the start of the filter name. Shortcuts to common groupings are available:

Shortcut nameDescription
upstreamAny upstream variant
downstreamAny downstream variant
utrAny UTR variant
spliceAny splicing region variants
codingAny variant that falls in the coding region of a transcript
coding_changeAny variant that causes a coding change in the transcript
regulatoryAny variant that falls in a regulatory or binding motif feature

To reproduce a filtered version of the input file, add the flag --original
Not used by default
--allow_non_variant
  When using VCF format as input and output, by default the VEP will skip non-variant lines of input (where the ALT allele is null). Enabling this option the lines will be printed in the VCF output with no consequence data added.

Database options

Flag Alternate Description
--database
  Enable the VEP to use local or remote databases.
--host [hostname]
  Manually define the database host to connect to. Users in the US may find connection and transfer speeds quicker using our East coast mirror, useastdb.ensembl.org. Default = "ensembldb.ensembl.org"
--user [username]
-u
Manually define the database username. Default = "anonymous"
--password [password]
--pass
Manually define the database password. Not used by default
--port [number]
  Manually define the database port. Default = 5306
--genomes
  Override the default connection settings with those for the Ensembl Genomes public MySQL server. Required when using any of the Ensembl Genomes species. Not used by default
--gencode_basic
  Limit your analysis to transcripts belonging to the GENCODE basic set. This set has fragmented or problematic transcripts removed.
--refseq
 

Instead of using the core database, use the otherfeatures database to retrieve transcripts. This database contains transcript objects corresponding to RefSeq transcripts (to include CCDS and Ensembl ESTs also, use --all_refseq). Consequence output will be given relative to these transcripts in place of the default Ensembl transcripts (see documentation)

You should also specify this option if you have installed the RefSeq cache in order for the VEP to pick up the alternate cache directory.

--merged
 

Use the merged Ensembl and RefSeq cache. Consequences are flagged with the SOURCE of each transcript used.

--all_refseq
  When using the RefSeq or merged cache, include e.g. CCDS and Ensembl EST transcripts in addition to those from RefSeq (see documentation). Only works when using --refseq or --merged
--lrg
  Map input variants to LRG coordinates (or to chromosome coordinates if given in LRG coordinates), and provide consequences on both LRG and chromosomal transcripts. Not compatible with --offline
--db_version [number]
--db
Force the script to connect to a specific version of the Ensembl databases. Not recommended as there will usually be conflicts between software and database versions. Not used by default
--registry [filename]
  Defining a registry file overwrites other connection settings and uses those found in the specified registry file to connect. Not used by default

Advanced options

Flag Alternate Description
--no_whole_genome
  Force the script to run in non-whole-genome mode. This was the original default mode for the VEP script, but has now been superceded by whole-genome mode, which is the default. In this mode, variants are analysed one at a time, with no caching of transcript data. Not used by default
--buffer_size [number]
  Sets the internal buffer size, corresponding to the number of variations that are read in to memory simultaneously. Set this lower to use less memory at the expense of longer run time, and higher to use more memory with a faster run time. Default = 5000
--write_cache
  Enable writing to the cache. Not used by default
--build [all|list]
  Build a complete cache for the selected species from the database. Either specify a list of chromosomes (see --chr for how to do this), or use
--build all
to build for all top-level chromosomes. WARNING: Do not use this flag when connected to one of the public databases - please instead download a pre-built cache or build against a local database. Not used by default
--compress [command]
  By default the VEP uses the utility zcat to decompress cached files. On some systems zcat may not be installed or may misbehave; by specifying one of --compress gzcat or --compress "gzip -dc" you may be able to bypass these problems. Not used by default
--skip_db_check
  ADVANCED Force the script to use a cache built from a different host than specified with --host. Only use this if you are sure the two hosts are compatible (e.g. ensembldb.ensembl.org can be considered compatible with useastdb.ensembl.org as the data is mirrored between the two). Not used by default
--cache_region_size [size]
  ADVANCED The size in base-pairs of the region covered by one file in the cache. By default this is 1MB, which produces approximately ~500 files maximum per sub-directory in human. Reducing this can reduce the amount of memory and decrease the run-time when you use a cache built this way. Note that you must specify the same --cache_region_size when both building/writing to the cache and reading from it. Not used by default

Performance

In optimal conditions, the VEP script is capable of processing >2 million variants in 1 hour. Run time is dependent on various factors, and is especially affected by the chromosomal distribution of variants. Variants in exonic regions naturally take longer to process than those in intronic or intergenic regions. Due to the way transcript data is cached in memory, the VEP will, for example, process a file containing 100 variants that fall in one gene faster than it would a file containing 100 variants in 100 different genes.

The VEP is also optimised to run on input files that are sorted in chromosomal order. Unsorted files will still work, albeit more slowly.

For very large files (for example those from whole-genome sequencing), the VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). The VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:

 tabix -h variants.vcf.gz 12:1000000-20000000 | perl variant_effect_predictor.pl -cache -vcf 

Forking

Using forking enables the VEP to run multiple parallel "threads", with each thread processing a subset of your input. Most modern computers have more than one processor core, so running the VEP with forking enabled can give huge speed increases (3-4x faster in most cases). Even computers with a single core will see speed benefits due to overheads associated with using object-oriented code in Perl.

To use forking, you must choose a number of forks to use with the --fork flag. Most users should use 4 forks:

perl variant_effect_predictor.pl -i my_input.vcf -fork 4 -offline

but depending on various factors specific to your setup you may see faster performance with fewer or more forks.

NB: VEP users writing plugins should be aware that while the VEP code attempts to preserve the state of any plugin-specific cached data between separate forks, there may be situations where data is lost. If you find this is the case, you should disable forking in the new() method of your plugin by deleting the "fork" key from the $config hash.