EnsemblEnsembl Home

Variant Effect Predictor data formats

Input

Both the web and script version of the VEP can use the same input formats. Formats can be auto-detected by the VEP script, but must be manually selected when using the web interface. The VEP can use VCF, pileup and HGVS notations in addition to the default format

Default

The default format is a simple whitespace-separated format (columns may be separated by space or tab characters), containing five required columns plus an optional identifier column:

  1. chromosome - just the name or number, with no 'chr' prefix
  2. start
  3. end
  4. allele - pair of alleles separated by a '/', with the reference allele first
  5. strand - defined as + (forward) or - (reverse).
  6. identifier - this identifier will be used in the VEP's output. If not provided, the VEP will construct an identifier from the given coordinates and alleles.
1   881907    881906    -/C   +
5   140532    140532    T/C   +
12  1017956   1017956   T/A   +
2   946507    946507    G/C   +
14  19584687  19584687  C/T   -
19  66520     66520     G/A   +    var1
8   150029    150029    A/T   +    var2

An insertion (of any size) is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:

8   12601     12600     -/C   +

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:

8   12600     12602     CGT/- -

VCF

The VEP also supports using VCF (Variant Call Format) version 4.0. This is a common format used by the 1000 genomes project, and can be produced as an output format by many variant calling tools.

Users using VCF should note a peculiarity in the difference between how Ensembl and VCF describe unbalanced variations. For any unbalanced variant (i.e. insertion, deletion or unbalanced substitution), the VCF specification requires that the base immediately before the variant should be included in both the reference and variant alleles. This also affects the reported position i.e. the reported position will be one base before the actual site of the variant.

In order to parse this correctly, the VEP needs to convert such variants into Ensembl-type coordinates, and it does this by removing the additional base and adjusting the coordinates accordingly. This means that if an identifier is not supplied for a variant (in the 3rd column of the VCF), then the identifier constructed and the position reported in the VEP's output file will differ from the input.

This problem can be overcome by either:

  1. ensuring each variant has a unique identifier specified in the 3rd column of the VCF
  2. using VCF format as output (--vcf) - this preserves the formatting of your input coordinates and alleles

The following examples illustrate how VCF describes a variant and how it is handled internally by the VEP. Consider the following aligned sequences (for the purposes of discussion on chromosome 20):

Ref: a t C g a // C is the reference base
1 : a t G g a // C base is a G in individual 1
2 : a t - g a // C base is deleted w.r.t. the reference in individual 2
3 : a t CAg a // A base is inserted w.r.t. the reference sequence in individual 3

Individual 1

The first individual shows a simple balanced substitution of G for C at base 3. This is described in a compatible manner in VCF and Ensembl styles. Firstly, in VCF:

20   3   .   C   G   .   PASS   .

And in Ensembl format:

 20   3   3   C/G   +

Individual 2

The second individual has the 3rd base deleted relative to the reference. In VCF, both the reference and variant allele columns must include the preceding base (T) and the reported position is that of the preceding base:

20   2   .   TC   T   .   PASS   .

In Ensembl format, the preceding base is not included, and the start/end coordinates represent the region of the sequence deleted. A "-" character is used to indicate that the base is deleted in the variant sequence:

20   3   3   C/-   +

The upshot of this is that while in the VCF input file the position of the variant is reported as 2, in the output file from the VEP the position will be reported as 3. If no identifier is provided in the third column of the VCF, then the constructed identifier will be:

20_3_C/-

Individual 3

The third individual has an "A" inserted between the 3rd and 4th bases of the sequence relative to the reference. In VCF, as for the deletion, the base before the insertion is included in both the reference and variant allele columns, and the reported position is that of the preceding base:

20   3   .   C   CA   .   PASS   .

In Ensembl format, again the preceding base is not included, and the start/end positions are "swapped" to indicate that this is an insertion. Similarly to a deletion, a "-" is used to indicate no sequence in the reference:

 20   4   3   -/A   +

Again, the output will appear different, and the constructed identifier may not be what is expected:

20_3_-/A

The solution is to always add a unique identifer for each of your variants to the VCF file, or use VCF as your output format.

Structural variants

The VEP can also call consequences on structural variants encoded in tab-delimited or VCF format. To recognise a variant as a structural variant, the allele string (or "SVTYPE" INFO field in VCF) must be set to one of the currently recognised values:

  • INS - insertion
  • DEL - deletion
  • DUP - duplication
  • TDUP - tandem duplication

Examples of structural variants encoded in tab-delimited format:

1    160283    471362    DUP
1    1385015   1387562   DEL

Examples of structural variants encoded in VCF format:

#CHROM  POS     ID   REF  ALT    QUAL  FILTER  INFO                    FORMAT
1       160283  sv1  .    <DUP>  .     .       SVTYPE=DUP;END=471362   .
1       1385015 sv2  .    <DEL>  .     .       SVTYPE=DEL;END=1387562  .

See the VCF definition document for more detail on how to describe structural variants in VCF format.

Pileup

The pileup format can also be used as input for the VEP. This is the output of the ssaha pileup package.

HGVS identifiers

See http://www.hgvs.org/mutnomen/ for details. These must be relative to genomic or Ensembl transcript coordinates. It also is possible to use RefSeq transcripts in both the web interface and the VEP script (see script documentation). This works for RefSeq transcripts that align to the genome correctly.

Examples:

ENST00000207771.3:c.344+626A>T
ENST00000471631.1:c.28_33delTCGCGG
ENST00000285667.3:c.1047_1048insC
5:g.140532T>C

Examples using RefSeq identifiers (using --refseq in the VEP script, or select the otherfeatures transcript database on the web interface and input type of HGVS):

NM_153681.2:c.7C>T
NM_005239.4:c.190G>A
NM_001025204.1:c.336G>A

HGVS protein notations may also be used, provided that they unambiguously map to a single genomic change. Due to redundancy in the amino acid code, it is not always possible to work out the corresponding genomic sequence change for a given protein sequence change. The following example is for a permissable protein notation in dog (Canis familiaris):

ENSCAFP00000040171.1:p.Thr92Asn

Variant identifiers

These should be e.g. dbSNP rsIDs, or any synonym for a variant present in the Ensembl Variation database. See here for a list of identifier sources in Ensembl.


Output

The output format from the web and script VEP is the same. The output columns are:

  1. Uploaded variation - as chromosome_start_alleles
  2. Location - in standard coordinate format (chr:start or chr:start-end)
  3. Allele - the variant allele used to calculate the consequence
  4. Gene - Ensembl stable ID of affected gene
  5. Feature - Ensembl stable ID of feature
  6. Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
  7. Consequence - consequence type of this variation
  8. Position in cDNA - relative position of base pair in cDNA sequence
  9. Position in CDS - relative position of base pair in coding sequence
  10. Position in protein - relative position of amino acid in protein
  11. Amino acid change - only given if the variation affects the protein-coding sequence
  12. Codon change - the alternative codons with the variant base in upper case
  13. Co-located variation - known identifier of existing variation
  14. Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
    • SYMBOL - the gene symbol
    • SYMBOL_SOURCE - the source of the gene symbol
    • STRAND - the DNA strand (1 or -1) on which the transcript/feature lies
    • ENSP - the Ensembl protein identifier of the affected transcript
    • SWISSPROT - UniProtKB/Swiss-Prot identifier of protein product
    • TREMBL - UniProtKB/TrEMBL identifier of protein product
    • UNIPARC - UniParc identifier of protein product
    • HGVSc - the HGVS coding sequence name
    • HGVSp - the HGVS protein sequence name
    • SIFT - the SIFT prediction and/or score, with both given as prediction(score)
    • PolyPhen - the PolyPhen prediction and/or score
    • MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
    • MOTIF_POS - The relative position of the variation in the aligned TFBP
    • HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
    • MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
    • CELL_TYPE - List of cell types and classifications for regulatory feature
    • CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
    • CCDS - the CCDS identifer for this transcript, where applicable
    • INTRON - the intron number (out of total number)
    • EXON - the exon number (out of total number)
    • DOMAINS - the source and identifer of any overlapping protein domains
    • DISTANCE - Shortest distance from variant to transcript
    • IND - individual name
    • ZYG - zygosity of individual genotype at this locus
    • SV - IDs of overlapping structural variants
    • FREQS - Frequencies of overlapping variants used in filtering
    • GMAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1
    • AFR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined African population
    • AMR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined American population
    • ASN_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined Asian population
    • EUR_MAF - Minor allele and frequency of existing variation in 1000 Genomes Phase 1 combined European population
    • AA_MAF - Minor allele and frequency of existing variant in NHLBI-ESP African American population
    • EA_MAF - Minor allele and frequency of existing variant in NHLBI-ESP European American population
    • CLIN_SIG - Clinical significance of variant from dbSNP
    • BIOTYPE - Biotype of transcript
    • PUBMED - Pubmed ID(s) of publications that cite existing variant
    • SOMATIC - Somatic status of existing variation(s)
    • ALLELE_NUM - Allele number from input; 0 is reference, 1 is first alternate etc

Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the VEP script. Output fields can be configured using the --fields flag when running the VEP script.

11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000525319  Transcript         missense_variant           742  716  239  T/N  aCc/aAc  -  SIFT=deleterious(0);PolyPhen=unknown(0)
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000534381  Transcript         5_prime_UTR_variant        -    -    -    -    -        -  -
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000529055  Transcript         downstream_variant         -    -    -    -    -        -  -
11_224585_G/A    11:224585   A  ENSG00000142082  ENST00000529937  Transcript         intron_variant             -    -    -    -    -        -  HGVSc=ENST00000529937.1:c.136-346G>A
22_16084370_G/A  22:16084370 A  -                ENSR00000615113  RegulatoryFeature  regulatory_region_variant  -    -    -    -    -        -  -

The VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.

## ENSEMBL VARIANT EFFECT PREDICTOR v76
## Output produced at 2014-08-01 16:09:38
## Connected to homo_sapiens_core_76_38 on ensembldb.ensembl.org
## Using API version 76, DB version 76
## Extra column keys:
## DISTANCE : Shortest distance from variant to transcript

VCF output

The VEP script can also generate VCF output using the --vcf flag. Consequences are added in the INFO field of the VCF file, using the key "CSQ". Data fields are encoded separated by "|"; the order of fields is written in the VCF header. Output fields can be configured by using --fields. Unpopulated fields are represented by an empty string.

VCFs produced by the VEP can be filtered by filter_vep.pl in the same way as standard format output files.

If the input format was VCF, the file will remain unchanged save for the addition of the CSQ field and the header (unless using any filtering). If an existing CSQ field is found, it will be replaced by the one added by the VEP.

Custom data added with --custom are added as separate fields, using the key specified for each data file.

Commas in fields are replaced with ampersands (&) to preserve VCF format.

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|SYMBOL|SIFT">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
21      26960070        rs116645811     G       A       .       .       CSQ=A|ENSG00000260583|ENST00000567517|Transcript|upstream_gene_variant|||||||4432|LINC00515|,A|ENSG00000154719|ENST00000352957|Transcript|intron_variant||||||||MRPL39|,A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|||MRPL39|tolerated(0.06)

JSON output

The VEP can produce output in the form of serialised JSON objects using the --json flag. JSON is a serialisation format that can be parsed and processed easily by many packages and programming languages; it is used as the default output format for Ensembl's REST server.

Each input variant is reported as a single JSON object which constitutes one line of the output file. The JSON object is structured somewhat differently to the other VEP output formats, in that per-variant fields (e.g. co-located existing variant details) are reported only once. Consequences are grouped under the feature type that they affect (Transcript, Regulatory Feature, etc). The original input line (e.g. from VCF input) is reported under the "input" key in order to aid aligning input with output.

Here follows an example of JSON output (prettified and redacted for display here):

{
  "input": "1 230845794 test1 A G . . .",
  "id": "test1",
  "seq_region_name": "1",
  "start": 230845794,
  "end": 230845794,
  "strand": 1,
  "allele_string": "A/G",
  "most_severe_consequence": "missense_variant",
  "colocated_variants": [
    {
      "id": "rs699",
      "seq_region_name": "1",
      "start": 230845794,
      "end": 230845794,
      "strand": 1,
      "allele_string": "A/G",
      "minor_allele": "A",
      "minor_allele_freq": 0.3384,
      "afr_allele": "A",
      "afr_maf": 0.13,
      "amr_allele": "A",
      "amr_maf": 0.36,
      "asn_allele": "A",
      "asn_maf": 0.16,
      "eur_allele": "A",
      "eur_maf": 0.41,
      "pubmed": [
        18513389,
        23716723
      ]
    },
    {
      "seq_region_name": "1",
      "strand": 1,
      "id": "COSM425562",
      "allele_string": "A/G",
      "start": 230845794,
      "end": 230845794
    }
  ],
  "transcript_consequences": [
    {
      "variant_allele": "G",
      "consequence_terms": [
        "missense_variant"
      ],
      "gene_id": "ENSG00000135744",
      "gene_symbol": "AGT",
      "gene_symbol_source": "HGNC",
      "transcript_id": "ENST00000366667",
      "biotype": "protein_coding",
      "strand": -1,
      "cdna_start": 1018,
      "cdna_end": 1018,
      "cds_start": 803,
      "cds_end": 803,
      "protein_start": 268,
      "protein_end": 268,
      "codons": "aTg/aCg",
      "amino_acids": "M/T",
      "polyphen_prediction": "benign",
      "polyphen_score": 0,
      "sift_prediction": "tolerated",
      "sift_score": 1,
      "hgvsc": "ENST00000366667.4:c.803T>C",
      "hgvsp": "ENSP00000355627.4:p.Met268Thr"
    }
  ],
  "regulatory_feature_consequences": [
    {
      "variant_allele": "G",
      "consequence_terms": [
        "regulatory_region_variant"
      ],
      "regulatory_feature_id": "ENSR00001529861"
    }
  ]
}

In accordance with JSON conventions, all keys are lower-case. Some keys also have different names and structures to those found in the other VEP output formats:

Key JSON equivalent(s) Notes
Consequence consequence_terms
Gene gene_id
Feature transcript_id, regulatory_feature_id, motif_feature_id Consequences are grouped under the feature type they affect
ALLELE variant_allele
SYMBOL gene_symbol
SYMBOL_SOURCE gene_symbol_source
ENSP protein_id
OverlapBP bp_overlap
OverlapPC percentage_overlap
Uploaded_variation id
Location seq_region_name, start, end, strand The variant's location field is broken down into constituent coordinate parts for clarity. "seq_region_name" is used in place of "chr" or "chromosome" for consistency with other parts of Ensembl's REST API
GMAF minor_allele, minor_allele_freq
*_maf *_allele, *_maf
cDNA_position cdna_start, cdna_end
CDS_position cds_start, cds_end
Protein_position protein_start, protein_end
SIFT sift_prediction, sift_score
PolyPhen polyphen_prediction, polyphen_score

Statistics

The VEP writes an HTML file containing statistics pertaining to the results of your job; it is named [output_file]_summary.html (with the default options the file will be named variant_effect_output.txt_summary.html). To view it you should open the file in your web browser.

To prevent the VEP writing a stats file, use the flag --no_stats. To have the VEP write a machine-readable text file in place of the HTML, use --stats_text. To change the name of the stats file from the default, use --stats_file [file].

The page contains several sections:

General statistics

This section contains two tables. The first describes the cache and/or database used, the version of the VEP, species, command line parameters, input/output files and run time. The second table contains information about the number of variants, and the number of genes, transcripts and regulatory features overlapped by the input.

Charts and tables

There then follows several charts, most with accompanying tables. Tables and charts are interactive; clicking on a row to highlight it in the table will highlight the relevant segment in the chart, and vice versa.