EnsemblEnsembl Home

Ensembl Variation - Data description

Below is a description of the data we store in the databases for Ensembl Variation.
For several different species in Ensembl, we import variant data (SNPs, CNVs, allele frequencies, genotypes, etc) from a variety of sources (e.g. dbSNP). Imported variants and alleles are subjected to a quality control process to flag suspect data.
We classify the variants into different classes and calculate the predicted consequence(s) of the variant and we have also created variant sets to help people retrieve a specific group of variants from a particular dataset.
In human, we calculate the linkage disequilibrium for each variant, by population.

See some examples of imported data on the Ensembl website (Human):

Location of a variant in the genome
Population genotypes and frequencies of a variant
Sample genotypes of a variant
Phenotype(s) associated with a variant
Citations

Variant species and data types

The Ensembl Variation database stores data imported from external sources and also data calculated on site.

  • Data imported from external sources (dbSNP, Sanger, DGVa, ...):
    • Variants (SNPs, in-dels, insertion, deletion, ...)
    • Structural variants (copy number variation, tandem duplication, inversion, ...)
    • Probes for copy number variations
    • Locations for variants and structural variants
    • Alleles
    • Populations
    • Genotypes
    • Phenotypes (e.g. Glaucoma in Human, list of phenotype sources)
    • Citations (extracted from dbSNP submissions and text mining performed by EPMC and UCSC)
  • Calculated data: see the Predicted data page.

Ensembl stores variant data for the following species, but note that users can still use the Variant Effect Predictor VEP on species for which we do not currently have a variation database.

There are currently 22 variation databases in Ensembl:

Short variant Long variant Genotype Association Prediction
Species Sequence variant (e!88 → e!89) Source(s) Structural variantSamplePopulationPhenotypeCitation SIFTPolyPhen
Cat
Felis catus
3.6 million+ - 1 source - - - - - -
Chicken
Gallus gallus
21 million+ - 1 source - -
Chimpanzee
Pan troglodytes
1.6 million+ - 1 source - - - -
Cow
Bos taurus
100 million+ - 1 source -
Dog
Canis familiaris
5.9 million+ - 1 source -
Fruitfly
Drosophila melanogaster
6.7 million+ - 1 source - - - - -
Gibbon
Nomascus leucogenys
1.1 million+ - 1 source - - - - - -
Horse
Equus caballus
5 million+ - 1 source -
Human
Homo sapiens
157 million+ (+337,000) 6 sources
Macaque
Macaca mulatta
3 million+ - 1 source - - -
Mouse
Mus musculus
80 million+ - 1 source -
Opossum
Monodelphis domestica
1.1 million+ - 1 source - - - - - - -
Orangutan
Pongo abelii
10 million+ - 1 source - - - - - -
Pig
Sus scrofa
60 million+ - 3 sources -
Platypus
Ornithorhynchus anatinus
1.3 million+ - 1 source - - - - -
Rat
Rattus norvegicus
5 million+ - 1 source - -
S. cerevisiae
Saccharomyces cerevisiae
263,000+ - 1 source - - - - -
Sheep
Ovis aries
51 million+ - 2 sources -
Tetraodon
Tetraodon nigroviridis
902,000+ - 1 source - - - - - - -
Turkey
Meleagris gallopavo
9,000+ - 1 source - - - - -
Zebra Finch
Taeniopygia guttata
1.7 million+ - 1 source - - - - -
Zebrafish
Danio rerio
17 million+ - 1 source - -
Colour legend: From 10 million From 1 million to 9.9 million From 1,000 to 999,999 From 1 to 999

The full list of species with their assembly versions in Ensembl is available here.


The majority of variants are imported from NCBI dbSNP. The data is imported when it is released by dbSNP and incorporated into the next Ensembl release. If dbSNP releases the data on a different assembly, Ensembl will remap the variant positions onto the current assembly. Data from projects like the HapMap Project and 1000 Genomes Project is imported once it has been submitted to dbSNP.

Ensembl also includes data from other sources. To view data from these sources in the browser go to a species Location page (e.g. for human), and click on the 'Configure this page' link on the left-hand side. The 'Variation' and 'Somatic mutations' sections contain a track list of all sources of variant data for that species.



Variant displays

Variant data can be viewed in the browser through pages such as:

  • Gene: Variation Table and Variation Image (for all variants in a gene) e.g. for all variants in KCNE2. Structural Variation to see all structural variants overlapping the gene.
  • Transcript: Population comparison, Comparison image (for comparing variants in a transcript across different individual or strain sequences) e.g. compare Tmco4 in different mouse strains
  • Transcript: Sequence, protein: list of the coding variants in protein coordinates.
  • Location: Region in Detail (Variants can be drawn using "Configure this page" at the left. The menu allows display of information in Ensembl databases along with external sources in DAS format such as DGV loci.)
  • Phenotype: A karyotype view to display the variants associated with a certain phenotype, e.g. phenotype "Glaucoma".

Examples:

Clicking on any variant on an Ensembl page will open a Variation tab with information about the flanking sequence and source for the selected variant. Links to linkage disequilibrium (LD) plots, phenotype information (for human) from EGA, OMIM and NHGRI-EBI and Ensembl genes and transcripts that include the variant can be found at the left of this tab. You may also view multiple genome alignments of various species, highlighting the variant. Ancestral sequences are included in this display.

Variant information can also be accessed using BioMart (gene or variation database), and the Perl API (variation API).



Variant classes

We call the class of a variant according to its component alleles and its mapping to the reference genome, and then display this information on the website. Internally we use Sequence Ontology terms, but we map these to our own 'display' terms where common usage differs from the SO definition (e.g. our term SNP is closer to the SO term SNV). All the classes we call, along with their equivalent SO term are shown in the table below. We also differentiate somatic mutations from germline variants in the display term, prefixing the term with 'somatic'. API users can fetch either the SO term or the display term.


* SO term SO description SO accession Called for (e.g.)
SNV SNVs are single nucleotide positions in genomic DNA at which different sequence alternatives exist. SO:0001483
  • Variant
Link
genetic_marker A measurable sequence feature that varies within a population. SO:0001645
  • Variant
Link
substitution A sequence alteration where the length of the change in the variant is the same as that of the reference. SO:1000002
  • Variant
Link
tandem_repeat Two or more adjacent copies of a region (of length greater than 1). SO:0000705
  • Variant
Link
Alu_insertion An insertion of sequence from the Alu family of mobile elements. SO:0002063
  • SV
Link
complex_structural_alteration A structural sequence alteration or rearrangement encompassing one or more genome fragments, with 4 or more breakpoints. SO:0001784
  • SV
Link
complex_substitution When no simple or well defined DNA mutation event describes the observed DNA change, the keyword \"complex\" should be used. Usually there are multiple equally plausible explanations for the change. SO:1000005
  • SV
Link
copy_number_gain A sequence alteration whereby the copy number of a given regions is greater than the reference sequence. SO:0001742
  • SV
Link
copy_number_loss A sequence alteration whereby the copy number of a given region is less than the reference sequence. SO:0001743
  • SV
Link
copy_number_variation A variation that increases or decreases the copy number of a given region. SO:0001019
  • SV
Link
duplication An insertion which derives from, or is identical in sequence to, nucleotides present at a known location in the genome. SO:1000035
  • SV
Link
interchromosomal_breakpoint A rearrangement breakpoint between two different chromosomes. SO:0001873
  • SV
Link
interchromosomal_translocation A translocation where the regions involved are from different chromosomes. SO:0002060
  • SV
Link
intrachromosomal_breakpoint A rearrangement breakpoint within the same chromosome. SO:0001874
  • SV
Link
intrachromosomal_translocation A translocation where the regions involved are from the same chromosome. SO:0002061
  • SV
Link
inversion A continuous nucleotide sequence is inverted in the same position. SO:1000036
  • SV
Link
loss_of_heterozygosity A functional variant whereby the sequence alteration causes a loss of function of one allele of a gene. SO:0001786
  • SV
Link
mobile_element_insertion A kind of insertion where the inserted sequence is a mobile element. SO:0001837
  • SV
Link
novel_sequence_insertion An insertion the sequence of which cannot be mapped to the reference genome. SO:0001838
  • SV
Link
short_tandem_repeat_variation A kind of sequence variant whereby a tandem repeat is expanded or contracted with regard to a reference. SO:0002096
  • SV
Link
tandem_duplication A duplication consisting of 2 identical adjacent regions. SO:1000173
  • SV
Link
translocation A region of nucleotide sequence that has translocated to a new position. The observed adjacency of two previously separated regions. SO:0000199
  • SV
Link
deletion The point at which one or more contiguous nucleotides were excised. SO:0000159
  • Variant
  • SV
Link
Link
indel A sequence alteration which included an insertion and a deletion, affecting 2 or more bases. SO:1000032
  • Variant
  • SV
Link
Link
insertion The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence. SO:0000667
  • Variant
  • SV
Link
Link
sequence_alteration A sequence_alteration is a sequence_feature whose extent is the deviation from another sequence. SO:0001059
  • Variant
  • SV
Link
Link
probe A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid. SO:0000051
  • CNV probe
Link

* Corresponding colours for the Ensembl web displays (only for Structural variants). The colours were originally based on the dbVar displays.

Human variant class distribution - Ensembl 89



Insertion and Deletion coordinates

In Ensembl, an insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand is indicated with start and end coordinates as follows:

   12601     12600   

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand will have start and end coordinates of :

   12600     12602    


Minor Alleles

Minor alleles and their frequencies are available for variants discovered in the 1000 Genomes Project.
These are calculated by dbSNP. If there are more than two alleles, the second most common is reported (See example: rs200077393).
This allows common variants to be distinguished from rare variants in situations where deep sequencing has identified a third rare allele.



Populations

For each variant we store population data (allele frequencies, genotypes) available from different projects (e.g. 1000 Genomes, HapMap). You can see this on our variation pages, for example: rs699. Frequencies displayed (to 3 decimal places) may not add up to 1 due to rounding.
Below we list the population names associated with the main genotyping projects available in Ensembl Variation:

Name Size Description
1000GENOMES:phase_3:ALL 2504 All phase 3 individuals
1000GENOMES:phase_3:AFR 661 African
  • 1000GENOMES:phase_3:ACB
96 African Caribbean in Barbados
  • 1000GENOMES:phase_3:ASW
61 African Ancestry in Southwest US
  • 1000GENOMES:phase_3:ESN
99 Esan in Nigeria
  • 1000GENOMES:phase_3:GWD
113 Gambian in Western Division, The Gambia
  • 1000GENOMES:phase_3:LWK
99 Luhya in Webuye, Kenya
  • 1000GENOMES:phase_3:MSL
85 Mende in Sierra Leone
  • 1000GENOMES:phase_3:YRI
108 Yoruba in Ibadan, Nigeria
1000GENOMES:phase_3:AMR 347 American
  • 1000GENOMES:phase_3:CLM
94 Colombian in Medellin, Colombia
  • 1000GENOMES:phase_3:MXL
64 Mexican Ancestry in Los Angeles, California
  • 1000GENOMES:phase_3:PEL
85 Peruvian in Lima, Peru
  • 1000GENOMES:phase_3:PUR
104 Puerto Rican in Puerto Rico
1000GENOMES:phase_3:EAS 504 East Asian
  • 1000GENOMES:phase_3:CDX
93 Chinese Dai in Xishuangbanna, China
  • 1000GENOMES:phase_3:CHB
103 Han Chinese in Bejing, China
  • 1000GENOMES:phase_3:CHS
105 Southern Han Chinese, China
  • 1000GENOMES:phase_3:JPT
104 Japanese in Tokyo, Japan
  • 1000GENOMES:phase_3:KHV
99 Kinh in Ho Chi Minh City, Vietnam
1000GENOMES:phase_3:EUR 503 European
  • 1000GENOMES:phase_3:CEU
99 Utah residents with Northern and Western European ancestry
  • 1000GENOMES:phase_3:FIN
99 Finnish in Finland
  • 1000GENOMES:phase_3:GBR
91 British in England and Scotland
  • 1000GENOMES:phase_3:IBS
107 Iberian populations in Spain
  • 1000GENOMES:phase_3:TSI
107 Toscani in Italy
1000GENOMES:phase_3:SAS 489 South Asian
  • 1000GENOMES:phase_3:BEB
86 Bengali in Bangladesh
  • 1000GENOMES:phase_3:GIH
103 Gujarati Indian in Houston, TX
  • 1000GENOMES:phase_3:ITU
102 Indian Telugu in the UK
  • 1000GENOMES:phase_3:PJL
96 Punjabi in Lahore, Pakistan
  • 1000GENOMES:phase_3:STU
102 Sri Lankan Tamil in the UK

Variants which have been discovered in this project have the "evidence status" 1000Genomes. On the website this corresponds to the icon .

Name Size Description
ExAC:ALL - All ExAC individuals
ExAC:AFR - African/African American
ExAC:AMR - Latino
ExAC:Adj - Adjusted (individuals with GQ >= 20 and depth DP >= 10)
ExAC:EAS - East Asian
ExAC:FIN - Finnish
ExAC:NFE - Non-Finnish European
ExAC:OTH - Other
ExAC:SAS - South Asian

Variants which have been discovered in this project have the "evidence status" ExAC. On the website this corresponds to the icon .

Name Size Description
CSHL-HAPMAP:HAPMAP-ASW 90 African ancestry in Southwest USA. ASW is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-CHB 90 Han Chinese in Beijing, China. CHB is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-CHD 100 Chinese in Metropolitan Denver, Colorado. CHD is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-GIH 100 Gujarati Indians in Houston, Texas. GIH is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-LWK 100 Luhya in Webuye, Kenya. LWK is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-MEX 90 Mexican ancestry in Los Angeles, California. MEX is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-MKK 180 Maasai in Kinyawa, Kenya. MKK is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HAPMAP-TSI 100 Toscans in Italy. TSI is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HapMap-CEU 185 Utah residents with Northern and Western European ancestry from the CEPH collection. CEU is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HapMap-HCB 48 45 unrelated Han Chinese in Beijing, China, representing one of the populations studied in the International HapMap project ( http://www.hapmap.org). See http://www.hapmap.org/citinghapmap.html.en for further information about,this population and others studied in the project. http://www.hapmap.org/hapmappopulations.html.en,also has relevant information.
CSHL-HAPMAP:HapMap-JPT 93 Japanese in Tokyo, Japan. JPT is one of the 11 populations in HapMap phase 3.
CSHL-HAPMAP:HapMap-YRI 185 Yoruba in Ibadan, Nigeria. YRI is one of the 11 populations in HapMap phase 3.

Variants which have been discovered in this project have the "evidence status" HapMap. On the website this corresponds to the icon .

Name Size Description
ESP6500:African_American - -
ESP6500:European_American - -

Variants which have been discovered in this project have the "evidence status" ESP. On the website this corresponds to the icon .

Name Size Description
Mouse Genomes Project 18 18 mouse strains whole-genome sequenced by the Mouse Genomes Project
Name Size Description
NextGen:IROA 20 Iranian Ovis aries (sheep) from the NextGen Project
NextGen:MOOA 160 Moroccan Ovis aries (sheep) from the NextGen Project
Name Size Description
NextGen:IRBT 8 Iranian Bos taurus (cow) from the NextGen Project


Variant sets

We use the concept of variant sets to group variants that share some property together. For example, we have grouped the variants identified in the 1000 Genomes phase 3 sets, these are divided into subsets based on population. For example, the set representing variants identified in the 1000 Genomes phase 3 study is named '1000 Genomes 3 - All' and has several subsets like: '1000 Genomes 3 - AFR', '1000 Genomes 3 - AMR', '1000 Genomes 3 - EAS', '1000 Genomes 3 - EUR' and '1000 Genomes 3 - SAS'. The variant sets can be displayed as separate tracks on the location view. This behaviour is controlled from the 'Variation' section on the configuration panel which is accessed by clicking the 'Configure this page' link in the left hand side navigation.

The sets are constructed during production and are stored in the database. The table below lists the available variant sets in the Ensembl variation database (subsets are indicated by bullet points).

Variant sets common to all species

Name Short name Description
All failed variations fail_all Variations that have failed the Ensembl QC checks

Variant sets specific to Human

Name Short name Description
1000 Genomes 3 - All 1kg_3 Variants genotyped by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - AFR
1kg_3_afr Variants genotyped in African individuals by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - AFR - common
1kg_3_afr_com Variants genotyped in African individuals by the 1000 Genomes project (phase 3) with frequency of at least 1%
  • 1000 Genomes 3 - AMR
1kg_3_amr Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - AMR - common
1kg_3_amr_com Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes 3 - All - common
1kg_3_com Variants genotyped by the 1000 Genomes project (phase 1) with frequency of at least 1%)
  • 1000 Genomes 3 - EAS
1kg_3_eas Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - EAS - common
1kg_3_eas_com Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes 3 - EUR
1kg_3_eur Variants genotyped in European individuals by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - EUR - common
1kg_3_eur_com Variants genotyped in European individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes 3 - SAS
1kg_3_sas Variants genotyped in South Asian individuals by the 1000 Genomes project (phase 3)
  • 1000 Genomes 3 - SAS - common
1kg_3_sas_com Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
All HapMap hapmap Variants which have been assayed by The International HapMap Project [http://hapmap.ncbi.nlm.nih.gov/]
  • HapMap - CEU
hapmap_ceu Variants which have been assayed by The International HapMap Project from CEU individuals
  • HapMap - HCB
hapmap_hcb Variants which have been assayed by The International HapMap Project from HCB individuals
  • HapMap - JPT
hapmap_jpt Variants which have been assayed by The International HapMap Project from JPT individuals
  • HapMap - YRI
hapmap_yri Variants which have been assayed by The International HapMap Project from YRI individuals
All LSDB-associated variants lsdb_variants Variants association from one or several Locus Specific DataBase (LSDB)
  • HbVar
HbVar Variants for the Human Hemoglobin Variants and Thalassemias database
  • Infevers
Infevers Variants from the registry of Hereditary Auto-inflammatory Disorders Mutations
  • KAT6BDB
KAT6BDB Variants from the K(lysine) acetyltransferase 6B database, BCM
  • LMDD
LMDD Variants from the Leiden Muscular Dystrophy Database
  • LSDB
LSDB Variants dbSNP annotates as being from LSDBs
  • OIVD
OIVD Variants from the Osteogenesis Imperfecta Variant Database
  • PAHdb
PAHdb Variants from the Phenylalanine hydroxylase database
  • dbPEX
dbPEX Variants from the PEX Gene Database
All phenotype/disease-associated variants ph_variants Variants that have been associated with a phenotype or a disease
  • All ClinVar
ClinVar Variants with ClinVar annotation
  • COSMIC phenotype variants
ph_cosmic Phenotype annotations of somatic mutations found in human cancers from the COSMIC project
  • Clinically associated variants
clin_assoc Variants described by ClinVar as being probable-pathogenic, pathogenic, drug-response or histocompatibility
  • HGMD-PUBLIC variants
ph_hgmd_pub Variants annotated by HGMD
  • NHGRI-EBI catalog phenotype variants
ph_nhgri Variants associated with phenotype data from the NHGRI-EBI GWAS catalog [http://www.ebi.ac.uk/gwas/]
  • OMIM phenotype variants
ph_omim Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • PhenCode
phencode Variants from the PhenCode Project
  • Uniprot phenotype variants
ph_uniprot Variations with phenotype annotations provided by Uniprot
ESP_6500 esp_6500 Variants from the NHLBI Exome Sequencing Project (investigating heart, lung and blood disorders)
ExAC exac Variants identified by the Exome Aggregation Consortium (ExAC) - release 0.3
Genotyping chip variants all_chips Variants which have assays on commercial chips held in ensembl
  • Affy GeneChip 500K
Affy_500K Variants from the Affymetrix GeneChip Human Mapping 500K Array Set
  • Affy GenomeWideSNP_6.0
Affy_SNP6 Variants from the Affymetrix Genome-Wide Human SNP Array 6.0
  • HumanOmniExpress
HumanOmniExpress Variants from the Illumina HumanOmniExpress 12v1-1_a whole genome genotyping array
  • Illumina_Cardio-Metabo_Chip
Cardio-Metabo_Chip Variants from the Illumina Cardio-Metabo_Chip genotyping array designed to target variants of interest for metabolic and cardiovascular disease traits
  • Illumina_CytoSNP12v1
Illumina_CytoSNP12v1 Variants from the Illumina Cyto SNP-12 v1 whole genome SNP genotyping chip designed for cytogenetic analysis
  • Illumina_ExomeChip
ExomeChip Variants from the Illumina ExomeChip genotyping array designed to target variants within exons
  • Illumina_Human1M-duo
Illumina_1M-duo Variants from the Illumina Human1M-duo v3 whole genome genotyping array designed for association studies
  • Illumina_Human610_Quad
Human610_Quad Variants from the Illumina Human610_Quad v1_B whole genome genotyping array designed for association studies
  • Illumina_Human660W-quad
Illumina_660Q Variants from the Illumina Human660W-quad whole genome genotyping array designed for association studies
  • Illumina_HumanHap550
HumanHap550 Variants from the Illumina Human550 v3.0 whole genome genotyping array designed for association studies
  • Illumina_HumanHap650Y
HumanHap650Y Variants from the Illumina HumanHap650Y v3.0 whole genome genotyping array designed for association studies
  • Illumina_HumanOmni1-Quad
HumanOmni1-Quad Variants from the Illumina HumanOmni1-Quad whole genome genotyping array designed for association studies
  • Illumina_HumanOmni2.5
HumanOmni2.5 Variants from the Illumina HumanOmni2.5 4v1 whole genome genotyping array designed for association studies
  • Illumina_HumanOmni5
HumanOmni5 Variants from the Illumina HumanOmni5v1 whole genome genotyping array designed for association studies
  • Illumina_ImmunoChip
ImmunoChip Variants from the Illumina ImmunoChip genotyping array designed to target variants of interest for autoimmune and inflammatory diseases
HumanCoreExome-12 HumanCoreExome Variants from the Illumina HumanCoreExome-12 v1 genotyping chip.


Clinical significance

See below the list of the clinical significance terms you can find in the human Ensembl Variation database:

IconValueClinVar exampleDGVa example
associationrs326-
benignrs328-
confers sensitivityrs1799853-
drug responsers4680-
likely benignrs248-
likely pathogenicrs25388esv1791726
not providedrs661-
otherrs334-
pathogenicrs268esv2830397
protectivers333-
risk factorrs333-
uncertain significancers914esv2830426

Further explanations about the clinical significance terms are available on the ClinVar website.

ClinVar rating

We use the ClinVar "four-star" rating system to indicate the quality of classification/validation of the variant:

RatingDescriptionExample
greygreygreygrey not classified by submitter rs397516070
goldgreygreygrey classified by single submitter rs45517277
goldgoldgreygrey classified by multiple submitters rs118203576
goldgoldgoldgrey reviewed by expert panel rs578776
goldgoldgoldgold practice guideline rs121908745


Evidence status

We provide a simple summary of the evidence supporting a variant as a guide to its potential reliability

Icon Name Description
Multiple observations The variant has multiple independent dbSNP submissions, i.e. submissions with a different submitter handles or different discovery samples
Frequency The variant is reported to be polymorphic in at least one sample
Cited The variant is cited in a PubMed article.
Phenotype or Disease The variant is associated with at least one phenotype or disease.
1000 Genomes The variant was discovered in the 1000 Genomes Project (human only)
ExAC The variant was discovered in the Exome Aggregation Consortium (human only).
HapMap The variant is polymorphic in at least one HapMap panel (human only)
ESP The variant was discovered in the Exome Sequencing Project (human only).


Quality control

A quality control process is employed to check imported variant data. Suspect variants and alleles are flagged, but are not withheld from downstream annotation. Data failing the checks is available through the browser where failure reasons are prominently listed. The API does not extract failed data by default, unless the database adaptor is specifically configured to do so using Bio::EnsEMBL::Variation::DBSQL::DBAdaptor::include_failed_variations();

Variants for which dbSNP holds citations from PubMed are not submitted to the QC process so are not flagged as failed.

Failure reasons

QC Type Reported failure reason Checking process
Mapping checks Variant does not map to the genome Variants with flanking sequences which do not map to reference or non-reference genomic sequences are flagged as failed.
Variant maps to more than 1 location For variants with flanking sequences mapping to a reference sequence, the number of mappings within all reference sequences is counted and those mapping more than once are flagged as failed. (Variants with a single mapping to both X and Y within a PAR region are not failed.) For variants with flanking sequences which do not map to a reference sequence, the number of mappings within all non-reference sequences is counted and those mapping more than once are flagged as failed.
Mapped position is not compatible with reported alleles The length of the reported alleles is compared to that expected given the coordinates specified for the variant. If none of the alleles match the expected length, the variant is flagged as failed.
None of the variant alleles match the reference allele The sequence at the coordinates specified for the variant are extracted from the reference genome and compared to the dbSNP refSNP alleles. If the extracted sequence does not match the expected alleles, the variant is flagged as failed.
Checks on the alleles of refSNPs Loci with no observed variant alleles in dbSNP Variants with dbSNP refSNP alleles reported as 'NOVARIATION' are flagged as failed.
Alleles contain ambiguity codes Variants with a IUPAC ambiguity code (eg. M, Y, R, etc ) in the dbSNP refSNP alleles are reported as failed.
Alleles contain non-nucleotide characters Variants with unexpected characters in the dbSNP refSNP alleles are reported as failed.
Checks on the alleles in dbSNP submissions Additional submitted allele data from dbSNP does not agree with the dbSNP refSNP alleles Alleles from all the dbSNP submissions for the rsID are checked against the dbSNP refSNP alleles. These alleles are primarily frequency submissions but can also bee from variant discovery submissions, and these are merged in the dbSNP pipeline with the pre-existing refSNP variant). Discrepant sets of alleles are flagged as failed as this will often highlight a strand error in the submission of frequency information for a known variant. The failure is flagged at the allele submission level.
External failure classification Flagged as suspect by dbSNP Variants reported as being suspect by dbSNP due to being in probable paralogous regions are imported but flagged as failed (human only).
New assembly Variant can not be re-mapped to the current assembly Variants that mapped to the previous assembly, but couldn't be remapped to the current assembly are flagged as failed.


Phenotype/disease ontologies

We import ontology terms related to human phenotypes, traits and diseases from a variety of sources using an automated process. Ontologies used are:

Descriptions are linked to ontology terms using:

  • Mappings provided by association data sources such as Orphanet, the NHGRI-EBI GWAS catalog and ClinVar
  • Annotations of OMIM terms created by HPO
  • Annotations of OMIM terms created by Orphanet
  • Ontology LookUp Service searches of full or truncated descriptions for exact matches to terms or synomyms
  • Zooma searches of annotations curated by the European Variation Archive team

References

  • Sebastian Köhler, Sandra C Doelken, Christopher J. Mungall, Sebastian Bauer, Helen V. Firth, Isabelle Bailleul-Forestier, Graeme C. M. Black, Danielle L. Brown, Michael Brudno, Jennifer Campbell, David R. FitzPatrick, Janan T. Eppig, Andrew P. Jackson, Kathleen Freson, Marta Girdea, Ingo Helbig, Jane A. Hurst, Johanna Jähn, Laird G. Jackson, Anne M. Kelly, David H. Ledbetter, Sahar Mansour, Christa L. Martin, Celia Moss, Andrew Mumford, Willem H. Ouwehand, Soo-Mi Park, Erin Rooney Riggs, Richard H. Scott, Sanjay Sisodiya, Steven Van Vooren, Ronald J. Wapner, Andrew O. M. Wilkie, Caroline F. Wright, Anneke T. Vulto-van Silfhout, Nicole de Leeuw, Bert B. A. de Vries, Nicole L. Washingthon, Cynthia L. Smith, Monte Westerfield, Paul Schofield, Barbara J. Ruef, Georgios V. Gkoutos, Melissa Haendel, Damian Smedley, Suzanna E. Lewis, and Peter N. Robinson
    The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
    Nucl. Acids Res. (1 January 2014) 42 (D1): D966-D974
    doi:10.1093/nar/gkt1026

  • Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J, Kolesnikov N, Zhukova, A, Brazma A, Parkinson H.
    Modeling sample variables with an Experimental Factor Ontology
    Bioinformatics (2010) 26 (8): 1112-1118
    doi:10.1093/bioinformatics