Variation File Format - Definition and supported options
The Variant Effect Predictor tool which appears as an option when you click on Manage your Data allows you to upload a set of variation data and predict the effect of the variants.
Note that the input and output formats are completely different.
Data must be supplied in a simple tab-separated format, containing five columns, all required:
- chromosome - just the name or number, with no 'chr' prefix
- allele - pair of alleles separated by a '/', with the reference allele first
- strand - defined as + (forward) or - (reverse).
1 881907 881906 -/C + 5 140532 140532 T/C + 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T - 19 66520 66520 G/A + 8 150029 150029 A/T +
An insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:
8 12601 12600 -/C +
A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:
8 12600 12602 CGT/- -
The following input file formats are also supported:
- Variant Call Format (VCF) - see http://www.1000genomes.org/wiki/Analysis/vcf4.0 for details.
- Pileup format
- HGVS notations - see http://www.hgvs.org/mutnomen/ for details. These must be relative to genomic or Ensembl transcript coordinates. It is possible, although less reliable, to use notations relative to RefSeq transcripts in the VEP script.
- Variant identifiers - these should be e.g. dbSNP rsIDs, or any synonym for a variant present in the Ensembl Variation database. See here for a list of identifier sources in Ensembl.
When using the web VEP, ensure that you have the correct file format selected from the drop-down menu. The VEP script is able to auto-detect the format of the input file.
The tool predicts the consequence of this variation, the amino acid position and change (if the variation falls within a protein) and the identifier of known variations that occur at this position. The output columns are:
- Uploaded variation - as chromosome_start_alleles
- Location - in standard coordinate format (chr:start or chr:start-end)
- Allele - the variant allele used to calculate the consequence
- Gene - Ensembl stable ID of affected gene
- Feature - Ensembl stable ID of feature
- Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
- Consequence - consequence type of this variation
- Relative position in cDNA - base pair position in cDNA sequence
- Relative position in CDS - base pair position in coding sequence
- Relative position in protein - amino acid position in protein
- Amino acid change - only given if the variation affects the protein-coding sequence
- Codons - the alternate codons with the variant base highlighted as bold (HTML) or upper case (text)
- Corresponding variation - identifier of existing variation
- Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
- HGNC - the HGNC gene identifier
- ENSP - the Ensembl protein identifier of the affected transcript
- HGVSc - the HGVS coding sequence name
- HGVSp - the HGVS protein sequence name
- SIFT - the SIFT prediction and/or score, with both given as prediction(score)
- PolyPhen - the PolyPhen prediction and/or score
- Condel - the Condel consensus prediction and/or score
- MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
- MOTIF_POS - The relative position of the variation in the aligned TFBP
- HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
- MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
- CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
- CCDS - the CCDS identifer for this transcript, where applicable
- INTRON - the intron number (out of total number)
- EXON - the exon number (out of total number)
- DOMAINS - the source and identifer of any overlapping protein domains
Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the VEP script. Output fields can be configured using the --fields flag when running the VEP script.
11_224088_C/A 11:224088 A ENSG00000142082 ENST00000525319 Transcript NON_SYNONYMOUS_CODING 742 716 239 T/N aCc/aAc - SIFT=deleterious(0);PolyPhen=unknown(0) 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000534381 Transcript 5_PRIME_UTR - - - - - - - 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000529055 Transcript DOWNSTREAM - - - - - - - 11_224585_G/A 11:224585 A ENSG00000142082 ENST00000529937 Transcript INTRONIC,NMD_TRANSCRIPT - - - - - - HGVSc=ENST00000529937.1:c.136-346G>A 22_16084370_G/A 22:16084370 A - ENSR00000615113 RegulatoryFeature REGULATORY_REGION - - - - - - -
The VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.
## ENSEMBL VARIANT EFFECT PREDICTOR v2.4 ## Output produced at 2012-02-20 16:09:38 ## Connected to homo_sapiens_core_66_37 on ensembldb.ensembl.org ## Using API version 66, DB version 66 ## Extra column keys: ## CANONICAL : Indicates if transcript is canonical for this gene ## CCDS : Indicates if transcript is a CCDS transcript ## HGNC : HGNC gene identifier ## ENSP : Ensembl protein identifer ## HGVSc : HGVS coding sequence name ## HGVSp : HGVS protein sequence name ## SIFT : SIFT prediction ## PolyPhen : PolyPhen prediction ## Condel : Condel SIFT/PolyPhen consensus prediction ## EXON : Exon number ## INTRON : Intron number ## DOMAINS : The source and identifer of any overlapping protein domains ## MOTIF_NAME : The source and identifier of a transcription factor binding profile (TFBP) aligned at this position ## MOTIF_POS : The relative position of the variation in the aligned TFBP ## HIGH_INF_POS : A flag indicating if the variant falls in a high information position of the TFBP ## MOTIF_SCORE_CHANGE : The difference in motif score of the reference and variant sequences for the TFBP