Variant Effect Predictor data formats
Both the web and script version of the VEP can use the same input formats. Formats can be auto-detected by the VEP script, but must be manually selected when using the web interface. The VEP can use VCF, pileup and HGVS notations in addition to the default format
The default format is a simple whitespace-separated format (columns may be separated by space or tab characters), containing five required columns plus an optional identifier column:
- chromosome - just the name or number, with no 'chr' prefix
- allele - pair of alleles separated by a '/', with the reference allele first
- strand - defined as + (forward) or - (reverse).
- identifier - this identifier will be used in the VEP's output. If not provided, the VEP will construct an identifier from the given coordinates and alleles.
1 881907 881906 -/C + 5 140532 140532 T/C + 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T - 19 66520 66520 G/A + var1 8 150029 150029 A/T + var2
An insertion (of any size) is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:
8 12601 12600 -/C +
A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:
8 12600 12602 CGT/- -
The VEP also supports using VCF (Variant Call Format) version 4.0. This is a common format used by the 1000 genomes project, and can be produced as an output format by many variant calling tools.
Users using VCF should note a peculiarity in the difference between how Ensembl and VCF describe unbalanced variations. For any unbalanced variant (i.e. insertion, deletion or unbalanced substitution), the VCF specification requires that the base immediately before the variant should be included in both the reference and variant alleles. This also affects the reported position i.e. the reported position will be one base before the actual site of the variant.
In order to parse this correctly, the VEP needs to convert such variants into Ensembl-type coordinates, and it does this by removing the additional base and adjusting the coordinates accordingly. This means that if an identifier is not supplied for a variant (in the 3rd column of the VCF), then the identifier constructed and the position reported in the VEP's output file will differ from the input.
This problem can be overcome by either:
- ensuring each variant has a unique identifier specified in the 3rd column of the VCF
- using VCF format as output (--vcf) - this preserves the formatting of your input coordinates and alleles
The following examples illustrate how VCF describes a variant and how it is handled internally by the VEP. Consider the following aligned sequences (for the purposes of discussion on chromosome 20):
Ref: a t C g a // C is the reference base 1 : a t G g a // C base is a G in individual 1 2 : a t - g a // C base is deleted w.r.t. the reference in individual 2 3 : a t CAg a // A base is inserted w.r.t. the reference sequence in individual 3
The first individual shows a simple balanced substitution of G for C at base 3. This is described in a compatible manner in VCF and Ensembl styles. Firstly, in VCF:
20 3 . C G . PASS .
And in Ensembl format:
20 3 3 C/G +
The second individual has the 3rd base deleted relative to the reference. In VCF, both the reference and variant allele columns must include the preceding base (T) and the reported position is that of the preceding base:
20 2 . TC T . PASS .
In Ensembl format, the preceding base is not included, and the start/end coordinates represent the region of the sequence deleted. A "-" character is used to indicate that the base is deleted in the variant sequence:
20 3 3 C/- +
The upshot of this is that while in the VCF input file the position of the variant is reported as 2, in the output file from the VEP the position will be reported as 3. If no identifier is provided in the third column of the VCF, then the constructed identifier will be:
The third individual has an "A" inserted between the 3rd and 4th bases of the sequence relative to the reference. In VCF, as for the deletion, the base before the insertion is included in both the reference and variant allele columns, and the reported position is that of the preceding base:
20 3 . C CA . PASS .
In Ensembl format, again the preceding base is not included, and the start/end positions are "swapped" to indicate that this is an insertion. Similarly to a deletion, a "-" is used to indicate no sequence in the reference:
20 4 3 -/A +
Again, the output will appear different, and the constructed identifier may not be what is expected:
The solution is to always add a unique identifer for each of your variants to the VCF file, or use VCF as your output format.
The VEP can also call consequences on structural variants encoded in tab-delimited or VCF format. To recognise a variant as a structural variant, the allele string (or "SVTYPE" INFO field in VCF) must be set to one of the currently recognised values:
- INS - insertion
- DEL - deletion
- DUP - duplication
- TDUP - tandem duplication
Examples of structural variants encoded in tab-delimited format:
1 160283 471362 DUP 1 1385015 1387562 DEL
Examples of structural variants encoded in VCF format:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 160283 sv1 . <DUP> . . SVTYPE=DUP;END=471362 . 1 1385015 sv2 . <DEL> . . SVTYPE=DEL;END=1387562 .
See the VCF definition document for more detail on how to describe structural variants in VCF format.
The pileup format can also be used as input for the VEP. This is the output of the ssaha pileup package.
See http://www.hgvs.org/mutnomen/ for details. These must be relative to genomic or Ensembl transcript coordinates. It also is possible to use RefSeq transcripts in both the web interface and the VEP script (see script documentation). This works for RefSeq transcripts that align to the genome correctly.
ENST00000207771.3:c.344+626A>T ENST00000471631.1:c.28_33delTCGCGG ENST00000285667.3:c.1047_1048insC 5:g.140532T>C
Examples using RefSeq identifiers (using --refseq in the VEP script, or select the otherfeatures transcript database on the web interface and input type of HGVS):
NM_153681.2:c.7C>T NM_005239.4:c.190G>A NM_001025204.1:c.336G>A
HGVS protein notations may also be used, provided that they unambiguously map to a single genomic change. Due to redundancy in the amino acid code, it is not always possible to work out the corresponding genomic sequence change for a given protein sequence change. The following example is for a permissable protein notation in dog (Canis familiaris):
These should be e.g. dbSNP rsIDs, or any synonym for a variant present in the Ensembl Variation database. See here for a list of identifier sources in Ensembl.
The output format from the web and script VEP is the same. The output columns are:
- Uploaded variation - as chromosome_start_alleles
- Location - in standard coordinate format (chr:start or chr:start-end)
- Allele - the variant allele used to calculate the consequence
- Gene - Ensembl stable ID of affected gene
- Feature - Ensembl stable ID of feature
- Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
- Consequence - consequence type of this variation
- Position in cDNA - relative position of base pair in cDNA sequence
- Position in CDS - relative position of base pair in coding sequence
- Position in protein - relative position of amino acid in protein
- Amino acid change - only given if the variation affects the protein-coding sequence
- Codon change - the alternative codons with the variant base in upper case
- Co-located variation - known identifier of existing variation
- Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
- SYMBOL - the gene symbol
- SYMBOL_SOURCE - the source of the gene symbol
- STRAND - the DNA strand (1 or -1) on which the transcript/feature lies
- ENSP - the Ensembl protein identifier of the affected transcript
- HGVSc - the HGVS coding sequence name
- HGVSp - the HGVS protein sequence name
- SIFT - the SIFT prediction and/or score, with both given as prediction(score)
- PolyPhen - the PolyPhen prediction and/or score
- MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
- MOTIF_POS - The relative position of the variation in the aligned TFBP
- HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
- MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
- CELL_TYPE - List of cell types and classifications for regulatory feature
- CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
- CCDS - the CCDS identifer for this transcript, where applicable
- INTRON - the intron number (out of total number)
- EXON - the exon number (out of total number)
- DOMAINS - the source and identifer of any overlapping protein domains
- DISTANCE - Shortest distance from variant to transcript
- IND - individual name
- ZYG - zygosity of individual genotype at this locus
- SV - IDs of overlapping structural variants
- FREQS - Frequencies of overlapping variants used in filtering
- GMAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1
- AFR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined African population
- AMR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined American population
- ASN_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined Asian population
- EUR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined European population
- AA_MAF - Minor allele and frequency of existing variant in NHLBI-ESP African American population
- EA_MAF - Minor allele and frequency of existing variant in NHLBI-ESP European American population
- CLIN_SIG - Clinical significance of variant from dbSNP
- BIOTYPE - Biotype of transcript
- PUBMED - Pubmed ID(s) of publications that cite existing variant
- ALLELE_NUM - Allele number from input; 0 is reference, 1 is first alternate etc
Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the VEP script. Output fields can be configured using the --fields flag when running the VEP script.
11_224088_C/A 11:224088 A ENSG00000142082 ENST00000525319 Transcript missense_variant 742 716 239 T/N aCc/aAc - SIFT=deleterious(0);PolyPhen=unknown(0) 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000534381 Transcript 5_prime_UTR_variant - - - - - - - 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000529055 Transcript downstream_variant - - - - - - - 11_224585_G/A 11:224585 A ENSG00000142082 ENST00000529937 Transcript intron_variant - - - - - - HGVSc=ENST00000529937.1:c.136-346G>A 22_16084370_G/A 22:16084370 A - ENSR00000615113 RegulatoryFeature regulatory_region_variant - - - - - - -
The VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.
## ENSEMBL VARIANT EFFECT PREDICTOR v75 ## Output produced at 2013-06-16 16:09:38 ## Connected to homo_sapiens_core_75_37 on ensembldb.ensembl.org ## Using API version 75, DB version 75 ## Extra column keys: ## DISTANCE : Shortest distance from variant to transcript
The VEP script can also generate VCF output using the --vcf flag. Consequences are added in the INFO field of the VCF file, using the key "CSQ". Data fields are encoded separated by "|"; the order of fields is written in the VCF header. Output fields can be configured by using --fields. Unpopulated fields are represented by an empty string.
VCFs produced by the VEP can be filtered by filter_vep.pl in the same way as standard format output files.
If the input format was VCF, the file will remain unchanged save for the addition of the CSQ field and the header (unless using any filtering). If an existing CSQ field is found, it will be replaced by the one added by the VEP.
Custom data added with --custom are added as separate fields, using the key specified for each data file.
Commas in fields are replaced with ampersands (&) to preserve VCF format.
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|SYMBOL|SIFT"> #CHROM POS ID REF ALT QUAL FILTER INFO 21 26960070 rs116645811 G A . . CSQ=A|ENSG00000260583|ENST00000567517|Transcript|upstream_gene_variant|||||||4432|LINC00515|,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||||MRPL39|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|||MRPL39|tolerated(0.06)
The VEP writes an HTML file containing statistics pertaining to the results of your job; it is named [output_file]_summary.html (with the default options the file will be named variant_effect_output.txt_summary.html). To view it you should open the file in your web browser.
To prevent the VEP writing a stats file, use the flag --no_stats. To have the VEP write a machine-readable text file in place of the HTML, use --stats_text. To change the name of the stats file from the default, use --stats_file [file].
The page contains several sections:
This section contains two tables. The first describes the cache and/or database used, the version of the VEP, species, command line parameters, input/output files and run time. The second table contains information about the number of variants, and the number of genes, transcripts and regulatory features overlapped by the input.
Charts and tables
There then follows several charts, most with accompanying tables. Tables and charts are interactive; clicking on a row to highlight it in the table will highlight the relevant segment in the chart, and vice versa.