Ensembl Variation - Data description
Below is a description of the data we store in the databases for Ensembl Variation.
For several different species in Ensembl, we import variation data (SNPs, CNVs, allele frequencies, genotypes, etc) from a variety of sources (e.g. dbSNP). Imported variants and alleles are subjected to a quality control process to flag suspect data.
We classify the variants into different classes and calculate the predicted consequence(s) of the variant and we have also created variation sets to help people retrieve a specific group of variants from a particular dataset.
In human, we calculate the linkage disequilibrium for each variant, by population.
See some examples of imported data on the Ensembl website (Human):
Variation species and data sources
Ensembl stores variation data for the following species, but note that users can still use the Variant Effect Predictor on species for which we do not currently have a variation database.
There are currently 22 variation databases in Ensembl:
The majority of variants are imported from NCBI dbSNP. The data is imported when it is released by dbSNP and incorporated into the next Ensembl release. If dbSNP releases the data on a different assembly, Ensembl will remap the variant positions onto the current assembly. Data from projects like the HapMap Project and 1000 Genomes Project is imported once it has been submitted to dbSNP.
Ensembl also includes data from other sources. To view data from these sources in the browser go to a species Location page (e.g. for human), and click on the 'Configure this page' link on the left-hand side. The 'Variation' and 'Somatic mutations' sections contain a track list of all sources of variation data for that species.
Variation data can be viewed in the browser through pages such as:
- Gene: Variation Table and Variation Image (for all variations in a gene) e.g. for all variants in KCNE2. Structural Variation to see all structural variants overlapping the gene.
- Transcript: Population comparison, Comparison image (for comparing variants in a transcript across different individual or strain sequences) e.g. compare Tmco4 in different mouse strains
- Transcript: Sequence, protein: list of the coding variants in protein coordinates.
- Location: Region in Detail (Variations can be drawn using "Configure this page" at the left. The menu allows display of information in Ensembl databases along with external sources in DAS format such as DGV loci.)
- Phenotype: A karyotype view to display the variants associated with a certain phenotype, e.g. phenotype "Glaucoma".
Clicking on any variation on an Ensembl page will open a Variation tab with information about the flanking sequence and source for the selected variation. Links to linkage disequilibrium (LD) plots, phenotype information (for human) from EGA, OMIM and NHGRI and Ensembl genes and transcripts that include the variation can be found at the left of this tab. You may also view multiple genome alignments of various species, highlighting the variation. Ancestral sequences are included in this display.
The Ensembl Variation database stores data imported from external sources and also data calculated on site.
- Data imported from external sources (dbSNP, Sanger, DGVa, ...):
- Variations (SNPs, in-dels, insertion, deletion, ...)
- Structural variations (copy number variation, tandem duplication, inversion, ...)
- Probes for copy number variations
- Locations for variations and structural variations
- Phenotypes (e.g. list of phenotypes in human)
- Citations (extracted from dbSNP submissions and text mining performed by EPMC and UCSC)
- Calculated data: see the Predicted data page.
We call the class of a variation according to its component alleles and its mapping to the reference genome, and then display this information on the website. Internally we use Sequence Ontology terms, but we map these to our own 'display' terms where common usage differs from the SO definition (e.g. our term SNP is closer to the SO term SNV). All the classes we call, along with their equivalent SO term are shown in the table below. We also differentiate somatic mutations from germline variations in the display term, prefixing the term with 'somatic'. API users can fetch either the SO term or the display term.
|*||SO term||SO description||SO accession||Ensembl term||Called for|
|SNV||SNVs are single nucleotide positions in genomic DNA at which different sequence alternatives exist.||SO:0001483||SNP||Variation|
|indel||A sequence alteration which included an insertion and a deletion, affecting 2 or more bases.||SO:1000032||indel||Variation|
|substitution||A sequence alteration where the length of the change in the variant is the same as that of the reference.||SO:1000002||substitution||Variation|
|tandem_repeat||Two or more adjcent copies of a region (of length greater than 1).||SO:0000705||tandem_repeat||Variation|
|complex_structural_alteration||A structural sequence alteration or rearrangement encompassing one or more genome fragments.||SO:0001784||Complex||Structural variation|
|copy_number_gain||A sequence alteration whereby the copy number of a given regions is greater than the reference sequence.||SO:0001742||Gain||Structural variation|
|copy_number_loss||A sequence alteration whereby the copy number of a given region is less than the reference sequence.||SO:0001743||Loss||Structural variation|
|copy_number_variation||A variation that increases or decreases the copy number of a given region.||SO:0001019||CNV||Structural variation|
|duplication||One or more nucleotides are added between two adjacent nucleotides in the sequence; the inserted sequence derives from, or is identical in sequence to, nucleotides adjacent to insertion point.||SO:1000035||Duplication||Structural variation|
|interchromosomal_breakpoint||A rearrangement breakpoint between two different chromosomes.||SO:0001873||Interchromosomal breakpoint||Structural variation|
|intrachromosomal_breakpoint||A rearrangement breakpoint within the same chromosome.||SO:0001874||Intrachromosomal breakpoint||Structural variation|
|inversion||A continuous nucleotide sequence is inverted in the same position.||SO:1000036||inversion||Structural variation|
|mobile_element_insertion||A kind of insertion where the inserted sequence is a mobile element.||SO:0001837||Mobile element insertion||Structural variation|
|somatic_Mobile element insertion|
|novel_sequence_insertion||An insertion the sequence of which cannot be mapped to the reference genome.||SO:0001838||Novel sequence insertion||Structural variation|
|somatic_Novel sequence insertion|
|tandem_duplication||A duplication consisting of 2 identical adjacent regions.||SO:1000173||Tandem duplication||Structural variation|
|translocation||A region of nucleotide sequence that has translocated to a new position.||SO:0000199||translocation||Structural variation|
|deletion||The point at which one or more contiguous nucleotides were excised.||SO:0000159||deletion||Variation
|insertion||The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence.||SO:0000667||insertion||Variation
|sequence_alteration||A sequence_alteration is a sequence_feature whose extent is the deviation from another sequence.||SO:0001059||sequence_alteration||Variation
|probe||A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid.||SO:0000051||CNV_PROBE||CNV probe|
* Corresponding colours for the Ensembl web displays (only for Structural variations). The colours are based on the dbVar displays.
Insertion and Deletion coordinates
In Ensembl, an insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand is indicated with start and end coordinates as follows:
A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand will have start and end coordinates of :
For each variant we store population data (allele
frequencies, genotypes) available from different projects (e.g. 1000
Genomes, HapMap). You can see this on our variation pages, for
Below we list the population names associated with the 1000 Genomes Project and the HapMap Project:
Populations from the 1000 Genomes Project (Human)
|1000GENOMES:phase_1_ALL||1092||All individuals from phase 1 of the 1000 Genomes Project|
|1000GENOMES:phase_1_AFR||246||All African individuals from phase 1 of the 1000 Genomes Project (YRI, LWK, ASW)|
|1000GENOMES:phase_1_AMR||181||All American individuals from phase 1 of the 1000 Genomes Project|
|1000GENOMES:phase_1_ASN||286||All East Asian individuals from phase 1 of the 1000 Genomes Project (CHB, JPT, CHS)|
|1000GENOMES:phase_1_ASW||61||Americans of African Ancestry in SW USA|
|1000GENOMES:phase_1_CEU||85||Utah Residents (CEPH) with Northern and Western European ancestry|
|1000GENOMES:phase_1_CHB||97||Han Chinese in Bejing, China|
|1000GENOMES:phase_1_CHS||100||Southern Han Chinese|
|1000GENOMES:phase_1_CLM||60||Colombian from Medellian, Colombia|
|1000GENOMES:phase_1_EUR||379||All European individuals from phase 1 of the 1000 Genomes Project (CEU, TSI, FIN, GBR, IBS)|
|1000GENOMES:phase_1_FIN||93||Finnish in Finland|
|1000GENOMES:phase_1_GBR||89||British in England and Scotland|
|1000GENOMES:phase_1_IBS||14||Iberian population in Spain|
|1000GENOMES:phase_1_JPT||89||Japanese in Tokyo, Japan|
|1000GENOMES:phase_1_LWK||97||Luhya in Webuye, Kenya|
|1000GENOMES:phase_1_MXL||66||Mexican Ancestry from Los Angeles USA|
|1000GENOMES:phase_1_PUR||55||Puerto Ricans from Puerto Rica|
|1000GENOMES:phase_1_TSI||98||Toscani in Italy|
|1000GENOMES:phase_1_YRI||88||Yoruba in Ibadan, Nigera|
Populations from the HapMap Project (Human)
|CSHL-HAPMAP:HAPMAP-ASW||90||African ancestry in Southwest USA. ASW is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HapMap-CEU||180||Utah residents with Northern and Western European ancestry from the CEPH collection. CEU is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-CHB||90||Han Chinese in Beijing, China. CHB is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-CHD||100||Chinese in Metropolitan Denver, Colorado. CHD is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-GIH||100||Gujarati Indians in Houston, Texas. GIH is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HapMap-HCB||45||45 unrelated Han Chinese in Beijing, China, representing one of the populations studied in the International HapMap project ( http://www.hapmap.org). See http://www.hapmap.org/citinghapmap.html.en for further information about,this population and others studied in the project. http://www.hapmap.org/hapmappopulations.html.en,also has relevant information.|
|CSHL-HAPMAP:HapMap-JPT||91||Japanese in Tokyo, Japan. JPT is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-LWK||100||Luhya in Webuye, Kenya. LWK is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-MEX||90||Mexican ancestry in Los Angeles, California. MEX is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-MKK||180||Maasai in Kinyawa, Kenya. MKK is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HAPMAP-TSI||100||Toscans in Italy. TSI is one of the 11 populations in HapMap phase 3.|
|CSHL-HAPMAP:HapMap-YRI||180||Yoruba in Ibadan, Nigeria. YRI is one of the 11 populations in HapMap phase 3.|
This population data is stored in the population table of the Variation database. You can see a description of the population table here.
We use the concept of variation sets to group variations that share some property together. For example, we have grouped the variations identified in the three different 1000 Genomes pilot studies into separate variation sets. The sets can be further subdivided into supersets and subsets to reflect hierarchical relationships between them. In the case of the 1000 Genomes phase 1 sets, these are divided into subsets based on population. For example, the set representing variations identified in the 1000 Genomes phase 1 study is named '1000 Genomes - All' and has several subsets like: '1000 Genomes - AFR', '1000 Genomes - AMR', '1000 Genomes - ASN' and '1000 Genomes - EUR'. The variation sets can be displayed as separate tracks on the location view. This behaviour is controlled from the 'Variation' section on the configuration panel which is accessed by clicking the 'Configure this page' link in the left hand side navigation.
The sets are constructed during production and are stored in the database. The table below lists the available variation sets in the Ensembl variation database (subsets are indicated by bullet points).
Variation sets common to all species
|All failed variations||fail_all||Variations that have failed the Ensembl QC checks|
Variation sets specific to Human
|1000 Genomes - All||1kg||Variants genotyped by the 1000 Genomes project (phase 1)|
||1kg_afr||Variants genotyped in African individuals by the 1000 Genomes project (phase 1)|
||1kg_afr_com||Variants genotyped in African individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%|
||1kg_amr||Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 1)|
||1kg_amr_com||Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%|
||1kg_asn||Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1)|
||1kg_asn_com||Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%|
||1kg_com||Variants genotyped by the 1000 Genomes project (phase 1) with frequency of at least 1%|
||1kg_eur||Variants genotyped in European individuals by the 1000 Genomes project (phase 1)|
||1kg_eur_com||Variants genotyped in European individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%|
|1000 Genomes - High coverage - Trios||1kg_hct||Variations called by the 1000 Genomes project on high coverage sequence data from two family trios (Pilot 2)|
|1000 Genomes - High quality||1kg_hq||Structural variants labelled as "High quality site" by the 1000 Genomes project (phase 1)|
|1000 Genomes - Low coverage||1kg_lc||Variations called by the 1000 Genomes project on low coverage sequence data from 179 unrelated individuals (Pilot 1)|
|Affy GeneChip 500K||Affy_500K||Variants from the Affymetrix GeneChip Human Mapping 500K Array Set|
|Affy GenomeWideSNP_6.0||Affy_SNP6||Variants from the Affymetrix Genome-Wide Human SNP Array 6.0|
|Illumina_Cardio-Metabo_Chip||Cardio-Metabo_Chip||Variants from the Illumina Cardio-Metabo_Chip genotyping array designed to target variants of interest for metabolic and cardiovascular disease traits|
|Illumina_Human610_Quad||Human610_Quad||Variants from the Illumina Human610_Quad v1_B whole genome genotyping array designed for association studies|
|Illumina_HumanHap550||HumanHap550||Variants from the Illumina Human550 v3.0 whole genome genotyping array designed for association studies|
|Illumina_HumanHap650Y||HumanHap650Y||Variants from the Illumina HumanHap650Y v3.0 whole genome genotyping array designed for association studies|
|Illumina_HumanOmni1-Quad||HumanOmni1-Quad||Variants from the Illumina HumanOmni1-Quad whole genome genotyping array designed for association studies|
|Illumina_HumanOmni2.5||HumanOmni2.5||Variants from the Illumina HumanOmni2.5 4v1 whole genome genotyping array designed for association studies|
|Illumina_HumanOmni5||HumanOmni5||Variants from the Illumina HumanOmni5v1 whole genome genotyping array designed for association studies|
|Illumina_1M-duo||Illumina_1M-duo||Variants from the Illumina Human1M-duo v3 whole genome genotyping array designed for association studies|
|Illumina_Human660W-quad||Illumina_660Q||Variants from the Illumina Human660W-quad whole genome genotyping array designed for association studies|
|Illumina_CytoSNP12v1||Illumina_CytoSNP12v1||Variants from the Illumina Cyto SNP-12 v1 whole genome SNP genotyping chip designed for cytogenetic analysis|
|ESP_6500||esp_6500||Variants from the NHLBI Exome Sequencing Project|
|All HapMap||hapmap||Variations which have been assayed by The International HapMap Project [http://hapmap.ncbi.nlm.nih.gov/]|
||hapmap_ceu||Variations which have been assayed by The International HapMap Project from CEU individuals|
||hapmap_hcb||Variations which have been assayed by The International HapMap Project from HCB individuals|
||hapmap_jpt||Variations which have been assayed by The International HapMap Project from JPT individuals|
||hapmap_yri||Variations which have been assayed by The International HapMap Project from YRI individuals|
|Anonymous Korean||ind_ak1||Variants genotyped in an anonymous Korean individual|
|Misha Angrist||ind_angrist||Variants genotyped in Misha Angrist|
|Henry Louis Gates Jr||ind_gates_jr||Variants genotyped in Henry Louis Gates Jr|
|Henry Louis Gates Sr||ind_gates_sr||Variants genotyped in Henry Louis Gates Sr|
|Rosalynn Gill||ind_gill||Variants genotyped in Rosalynn Gill|
|Anonymous Irish Male||ind_irish||Variants genotyped in an anonymous Irish Male|
|Marjolein Kriek||ind_kriek||Variants genotyped in Marjolein Kriek|
|Stephen Quake||ind_quake||Variants genotyped in Stephen Quake|
|Saqqaq||ind_saqqaq||Variants genotyped in a Palaeo-Eskimo Saqqaq individual|
|Saqqaq HC||ind_saqqaq_hc||Variants genotyped in a Palaeo-Eskimo Saqqaq individual (high confidence SNPs)|
|Seong-Jin Kim||ind_sjk||Variants genotyped in Seong-Jin Kim|
|ENSEMBL:Venter||ind_venter||Variants genotyped in Craig Venter|
|ENSEMBL:Watson||ind_watson||Variants genotyped in James Watson|
|YanHang||ind_yh||Variants genotyped in a Han Chinese individual (YanHuang Project)|
|All phenotype-associated variants||ph_variants||Variants that have been associated with a phenotype|
||ph_cosmic||Phenotype annotations of somatic mutations found in human cancers from the COSMIC project|
||ph_hgmd_pub||Variants annotated by HGMD|
||ph_nhgri||Variants associated with phenotype data from the NHGRI GWAS catalog [http://www.genome.gov/gwastudies/]|
||ph_omim||Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database|
||phencode||Variants from the PhenCode Project|
||ph_uniprot||Variations with phenotype annotations provided by Uniprot|
||clin_assoc||Variants described by ClinVar as being probable-pathogenic, pathogenic, drug-response or histocompatibility|
See below the list of the clinical significance terms you can find in the human Ensembl Variation database:
The clinical significance of a variant as reported by ClinVar and dbSNP.
The clinical significance of a structural variant as reported by DGVa.
We provide a simple summary of the evidence supporting a variant as a guide to its potential reliability
|Multiple_observations||The variant has multiple independent dbSNP submissions, i.e. submissions with a different submitter handles or different discovery samples|
|Frequency||The variant is reported to be polymorphic in at least one sample|
|HapMap||The variant is polymorphic in at least one HapMap panel (human only)|
|1000 Genomes||The variant was discovered in the 1000 genomes project (human only)|
|Cited||The variant is cited in a PubMed article.|
|ESP||The variant was discovered in the Exome Sequencing Project (human only).|
A quality control process is employed to check imported variation data. Suspect variations and alleles are flagged, but are not withheld from downstream annotation. Data failing the checks is available through the browser where failure reasons are prominently listed. The API does not extract failed data by default, unless the database adaptor is specifically configured to do so using Bio::EnsEMBL::Variation::DBSQL::DBAdaptor::include_failed_variations();
Variations for which dbSNP holds citations from PubMed are not submitted to the QC process so are not flagged as failed.
|QC Type||Reported failure reason||Checking process|
|Mapping checks||Variation does not map to the genome||Variations with flanking sequences which do not map to reference or non-reference genomic sequences are flagged as failed.|
|Variation maps to more than 1 location||For variations with flanking sequences mapping to a reference sequence, the number of mappings within all reference sequences is counted and those mapping more than once are flagged as failed. (Variants with a single mapping to both X and Y within a PAR region are not failed.) For variations with flanking sequences which do not map to a reference sequence, the number of mappings within all non-reference sequences is counted and those mapping more than once are flagged as failed.|
|Mapped position is not compatible with reported alleles||The length of the reported alleles is compared to that expected given the coordinates specified for the variation. If none of the alleles match the expected length, the variation is flagged as failed.|
|None of the variant alleles match the reference allele||The sequence at the coordinates specified for the variation are extracted from the reference genome and compared to the dbSNP refSNP alleles. If the extracted sequence does not match the expected alleles, the variation is flagged as failed.|
|Checks on the alleles of refSNPs||Loci with no observed variant alleles in dbSNP||Variations with dbSNP refSNP alleles reported as 'NOVARIATION' are flagged as failed.|
|Variation has more than 3 different alleles||Variations with all of A, T, G and C in the dbSNP refSNP alleles are flagged as failed.|
|Alleles contain ambiguity codes||Variations with a IUPAC ambiguity code (eg. M, Y, R, etc ) in the dbSNP refSNP alleles are reported as failed.|
|Alleles contain non-nucleotide characters||Variations with unexpected characters in the dbSNP refSNP alleles are reported as failed.|
|Checks on the alleles in dbSNP submissions||Additional submitted allele data from dbSNP does not agree with the dbSNP refSNP alleles||Alleles from dbSNP submissions ( primarily frequency submissions, but also variant discovery submissions which have been merged in the dbSNP pipeline with the pre-existing refSNP variation) are checked against the dbSNP refSNP alleles and discrepant sets flagged as failed. Often this will highlight a strand error in the submission of frequency information for a known variation. Tradditionally, the failure would be flagged at the variation level, in new database builds it is flagged at the allele submission level.|
|External failure classification||Flagged as suspect by dbSNP||Variations reported as being suspect by dbSNP due to being in probable paralogous regions are imported but flagged as failed (human only).|