Search for a term

Browse full list of terms

TermCategoryDescriptionRead more
1-to-1 orthologuesOrthologuesA type of orthologue assigned for a pair of species where only one copy is found in each species.
1-to-many orthologuesOrthologuesA type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species, due to (a) duplication event(s) in the second species.
1000 Genomes projectVariation source databaseThe goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the human populations studied. Ensembl display sample genotypes and population frequencies from the 1000 Genomes project. http://www.internationalgenome.org/
3 prime UTR variantVariant consequenceA UTR variant of the 3' UTR
3' incompleteTranscriptA protein-coding transcript which is missing the stop codon due to incomplete evidence.
3' overlapping ncRNALong non-coding RNA (lncRNA)Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
3' UTRTranscriptThe region of a coding cDNA downstream of the stop codon which is not translated.
5 prime UTR variantVariant consequenceA UTR variant of the 5' UTR
5' incompleteTranscriptA protein-coding transcript which is missing the start codon due to incomplete evidence.
5' UTRTranscriptThe region of a coding cDNA upstream of the start codon which is not translated.
ActiveRegulatory activityWhen a regulatory feature displays an epigenetic signature which is consistent with it carrying out its named function, for example an active Promoter has an epigenetic signature consistent with initiating transcription, while an active CTCF binding site will bind CTCF. It is analogous to a sprinter running.
AGPFile formatsA golden path. A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.
AlgorithmA sequence of computational tasks or actions that carry out a specific function.
AlignmentsGenome annotationA comparison between two or more sequences by matching identical and/or similar residues/nucleotides and assigning a score to the match.
Allele (gene)GeneDifferent versions of a gene found between the primary assembly and a patch or genome haplotype.
Allele (variant)VariantOne of a number of alternative forms of the same genetic locus/variant.
Alternative alleleAllele (variant)Any allele of a variant which is not the in the reference genome currently being studied. The alternative allele is not necessarily the minor allele.
Alternative sequenceGenome assemblyGenomic sequence that differs from the genomic DNA on the primary assembly. These are represented as sequence on top of the primary assembly. Provided by the GRC for human and mouse.
Alu insertionRepeatA dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it.
Ambiguity codeVariantA single letter code that represents two or more possible nucleotides at a single base locus.
Ancestral alleleAllele (variant)The allele which occurs at this locus in closely related species and is thought to reflect the allele present at the time of speciation. The ancestral allele may be the reference or the alternative allele, and the major or minor allele.
Animal QTLdbPhenotype source databaseProject aiming to house all publicly available QTL and association data on livestock animal species. Ensembl display phenotypes from the Animals QTLdb. https://www.animalgenome.org/cgi-bin/QTLdb/index
AntisenseLong non-coding RNA (lncRNA)Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
APPRISTranscriptAPPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods to identify the most functionally important transcript(s) of a gene.
APPRIS ALT1APPRISFor genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that is conserved in at least three tested species.
APPRIS ALT2APPRISFor genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that appear to be conserved in fewer than three tested species.
APPRIS P1APPRISTranscript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
APPRIS P2APPRISWhere the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
APPRIS P3APPRISWhere the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
APPRIS P4APPRISWhere the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
APPRIS P5APPRISWhere the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
BACCloneA vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria. Many genomes (such as human) were sequenced by cloning segments into BACs, amplifying and sequencing the clones.
BAM/CRAMFile formatsBAM and CRAM store alignments of NGS data to the genome. Ensembl allow attachment of BAM and CRAM files to view in against the gene, and store RNA-seq, ChIP-seq and DNase-seq in BAM.
Base pairs (genome size)Genome assemblyThe actual number of bases of sequence we have for a full genome assembly, including alternative sequences and PARs, excluding gaps.
BEDFile formatsBED is a simple format for listing genomic loci. It can be used to upload data to view in Ensembl, as a custom file for additional VEP annotation and is used to store and download constrained elements in Ensembl.
BedGraphFile formatsBedGraph allows you to store scores for loci in BED format, the loci can be of varying size. It can be uploaded to view in Ensembl.
Between species paraloguesParaloguesMembers of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node.
BigBedFile formatsBigBed is an indexed form of BED, which can be used to store larger scale data. Ensembl allow attachment of BigBed files to view against the genome and store peaks of regulatory evidence as BigBed.
BigWigFile formatsBigWig is an indexed form of wiggle and can be used to store larger scale data. Ensembl simplify NGS data, such as ChIP-seq and RNA-seq into BigWig to view in the browser. It can also be used to attach your own data to Ensembl.
BiotypeGeneA gene or transcript classification.
Bisulfite sequencingEpigenome evidenceA method to determine the methylation of genomic cytosines.
BLASTAlgorithmA sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query.
BlastZPairwise whole genome alignmentBlastZ is a program for aligning DNA sequences in a pairwise manner. It has been replaced by LASTZ.
BLATAlgorithmAn mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more.
BLOSUM 62AlgorithmA matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members).
Blueprint EpigenomesEpigenome source databaseProject aiming to apply functional genomics analysis on primary cells of the haematopoietic cell lineage from healthy and diseased individuals, to produce lineage-specific epigenomes. Used as a source for the Ensembl regulatory build. http://www.blueprint-epigenome.eu/
CADDAlgorithmA tool that integrates multiple annotations into one metric for scoring the deleteriousness of single nucleotide variants.
CCDSTranscriptA coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, MGI, HGNC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.
cDNATranscriptThe sequence of the spliced exons of a transcript expressed in DNA notation (T rather than U), representing the coding or sense strand. The cDNA contains the whole sequence of the RNA, including coding and untranslated sequence.
CDSTranscriptCoDing Sequence. The region of a cDNA which is translated. In Ensembl displays, the stop codon is included as part of the CDS sequence.
CentromereRepeatThe region of the chromosome at which the two sister chromatids are joined during mitosis and meiosis, mostly composed of satellite DNA.
chainFile formatsChain files describe the mapping between different genome assemblies. Ensembl store these on the FTP site.
ChIP-seqEpigenome evidenceA method to determine the genomic regions that proteins bind to.
CIGARAlignmentsThe cigar line defines the sequence of matches/mismatches and deletions (or gaps) in an alignment
Clinical significanceVariantA classification of a variant's impact on disease, taken from ClinVar.
ClinVarVariation source databaseNCBI resource that aggregates information about genomic variation and its relationship to human health. Ensembl display clinical significance and phenotypes from ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/
CloneGenome assemblyA segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.
CNVStructural variantCopy Number Variation: increases or decreases the copy number of a given locus. Subcategorised into Loss and Gain compared to the reference.
Coding sequence variantVariant consequenceA sequence variant that changes the coding sequence
CodonCDSThree base pairs in either DNA or RNA that code for an amino acid (or stop translation).
Complex structural alterationStructural variantA structural sequence alteration or rearrangement encompassing one or more genome fragments, with four or more breakpoints.
Complex substitutionStructural variantWhen no simple or well defined DNA mutation event describes the observed DNA change, the keyword ""complex"" should be used. Usually there are multiple equally plausible explanations for the change.
Constitutive exonExonExons that are not spliced out, therefore present in all transcripts of a given gene.
ContigGenome assemblyA contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information.
Coordinate systemGenome assemblyWhich level of the assembly we are working on.
COSMICVariation source databaseDatabase of somatic variants found in cancer. COSMIC licensing does not permit redistribution of the full dataset, but mutation identifiers, locations and tumour types are available in Ensembl. http://cancer.sanger.ac.uk/cosmic
CosmidCloneDNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.
CoverageGenome assemblyRefers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information.
CTCF binding sites Regulatory featuresRegions that bind CTCF, the insulator protein that demarcates open and closed chromatin.
Cytogenetic bandGenome assemblyA banding pattern on a chromosome resulting from staining and examination by microscopy. These are named in terms of the chromosome arm they are found on, and are often used as a shorthand for describing the location of genomic features.
D'Linkage disequilibriumThe difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0.
dbSNPVariation source databaseThe Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms in human, maintained by NCBI. https://www.ncbi.nlm.nih.gov/projects/SNP/
dbVarVariation source databasedbVar is NCBI's database of human genomic structural variation — insertions, deletions, duplications, inversions, mobile elements, and translocations. https://www.ncbi.nlm.nih.gov/dbvar/
DDBJINSDCThe Asian branch of INSDC. http://www.ddbj.nig.ac.jp/
DeletionSequence variantDeletion of one or more nucleotides
DGVaVariation source databaseThe Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.https://www.ebi.ac.uk/dgva
DNA methylationEpigenome evidenceModification of cytosines in CpGs with methyl groups, which is known to repress gene expression.
DNase sensitivityEpigenome evidenceA method to determine regions of open and closed chromatin.
Downstream gene variantVariant consequenceA sequence variant located 3' of a gene
DUSTAlgorithmA standalone application that looks for low complexity sequences.
EMBL (file format)File formatsEMBL files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
EMF Alignment formatFile formatsEnsembl Multi Format (EMF) stores genomic alignments in Ensembl.
ENAINSDCEurope's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. https://www.ebi.ac.uk/ena
ENCODEEpigenome source databaseProject aiming to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, by large scale functional analyses of laboratory cell lines. Used as a source for the Ensembl regulatory build. https://www.encodeproject.org/
Enhancers Regulatory featuresRegions that bind transcription factors and interact with promoters to stimulate transcription of distant genes.
Ensembl canonicalTranscriptA single transcript chosen for a gene which is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. This is defined in detail on http://www.ensembl.org/info/genome/genebuild/canonical.html
Ensembl default (VEP)File formatsEnsembl default is an input format for the VEP, used to describe the position and alleles of a variant.
Ensembl gene tree pipelineAlgorithmThe process by which Ensembl compare gene sequences in order to construct gene trees and predict homologues.
Ensembl GenebuildAlgorithmThe automatic process by which Ensembl plot known RNA and protein sequence onto the genome, using sequence similarity.
Ensembl HavanaEnsembl GenebuildHuman And Vertebrate ANalysis and Annotation. The team within Ensembl who manually annotate genes and transcripts for a subset of species.
Ensembl Regulatory BuildAlgorithmThe process by which Ensembl predict the location of regions that regulate gene expression using epigenomic evidence.
Ensembl sourcesPublicly available database that Ensembl imports data from.
EpigenomeRegulatory activityA cell type, such as a primary tissue or lab cell line, for which we have epigenome evidence and can predict regulatory features.
Epigenome evidenceRegulatory featuresExperimental data that is used to construct and determine activity of regulatory features.
Epigenome source databaseEnsembl sourcesDatabase from which Ensembl imports ChIP-seq, DNase-seq and other related datasets, which are used in the Ensembl regulatory build.
EPOMultiple whole genome alignmentThe EPO (Enredo, Pecan, Ortheus) pipeline is a three step pipeline for whole-genome multiple alignments, using Enredo segments, aligning them with Pecan and constructing ancestal sequences with Ortheus.
EponineAlgorithmEponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognising specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. http://www.sanger.ac.uk/science/tools/eponine
eQTLQTLGenetic loci where allelic variation is associated with expression levels of other genes.
ESTGenome annotationExpressed Sequence Tag. Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.
EVAVariation source databaseThe European Variation Archive is an open-access database of all types of genetic variation data from all species. https://www.ebi.ac.uk/eva/
Evidence statusVariantCodes that reflect the amount and type of evidence that supports the existence of a variant.
ExonTranscriptTranscribed genomic region that remains in the RNA after splicing, includes both the CDS and the UTRs.
ExonerateAlgorithmA fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
External referenceGenome annotationMapping between Ensembl genes, transcripts and proteins to the same features in other databases.
FASTAFile formatsFASTA is used to store finished nucleotide and peptide sequences. The Ensembl FTP site has genome, cDNA, CDS and peptide sequences in FASTA, and you can export FASTA from various webpages in Ensembl.
Feature elongationVariant consequenceA sequence variant located within a regulatory region
Feature truncationVariant consequenceA sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence
Fix patchPatchFix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.
Flagged variantVariantVariants that failed our quality control analyses, therefore they are flagged as suspicious.
Flanking sequenceGenome annotationSequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).
Forward strandGenome annotationDNA strand arbitrary defined as the strand with its 5' end at the tip of the short chromosome arm (p). If a gene is forward-stranded, its sense (sequence matching cDNA) is on the forward strand. Forward strand is reverse complementary to the reverse strand.
Frameshift variantVariant consequenceA sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three
GenBank (database)INSDCThe US branch of INSDC. https://www.ncbi.nlm.nih.gov/genbank/
GenBank (file format)File formatsGenBank files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
GENCODEGene source databaseThe aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human and mouse genomes at a high accuracy. The GENCODE gene set is the default geneset in Ensembl and is equivalent to the Ensembl/HAVANA merged genes. https://www.gencodegenes.org/
GENCODE BasicTranscriptA subset of the GENCODE transcript set, containing only 5' and 3' complete transcripts.
GENCODE ComprehensiveTranscriptThe full GENCODE transcript set, containing both complete transcripts and 5' and 3' incomplete transcripts.
GeneGenome annotationGenomic locus where transcription occurs. A gene may have one or more transcripts, which may or may not encode proteins.
Gene OntologyGene source databaseAn organised hierarchy of terms produced by the Gene Ontology Consortium, used to describe the function of proteins. GO terms are split into three subcategories: biological processes (what the protein does), cellular component (where in the cell the protein is found), and molecular function (how the protein acts). http://www.geneontology.org/
Gene source databaseEnsembl sourcesDatabase from which Ensembl imports cDNA or protein sequence for gene annotation, or gene names.
Gene splitParaloguesPairs of genes in a species that occur together in the same tree, but are actually two halves of the same gene split partway along.
Gene treeHomologuesA representation of the evolutionary relationship between homologues, constructed using the Ensembl gene tree pipeline.
Genetic markerVariantA measurable locus that varies within a population.
GeneWiseAlgorithmGeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/Tools/psa/genewise/
GenomeThe complete set of DNA found in each cell.
Genome annotationA genomic locus that has been annotated.
Genome assemblyGenomeA computational representation of the sequence of a haploid genome, representative of a species or strain.
GenotypeAllele (variant)The specific alleles that are present in an individual's genome. In diploid organisms two alleles make up the genotype (except for the sex chromosomes).
GENSCANAlgorithmAn HMM-based ab initio gene prediction method, used to create a track of ab initio genes in Ensembl. http://genes.mit.edu/GENSCAN.html
GFFFile formatsGFF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GFF, allow attachment of GFF files to view against the genome and allow custom annotation with the VEP using GFF files.
Global MAFMinor allele frequencyThe frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3.
gnomADVariation source databaseAn aggregation of publicly available whole genome and whole exome variant calling experiments in human. GnomAD was previously known as ExAC, when it contained only exome data. Ensembl display population frequencies from gnomAD. http://gnomad.broadinstitute.org/
Golden path (genome size)Genome assemblyThe golden path is the length of the non-redundant reference assembly. It excludes alternative sequences and PARs, but includes the estimated size of the gaps.
GTFFile formatsGTF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GTF, allow attachment of GTF files to view against the genome and allow custom annotation with the VEP using GTF files.
GVFFile formatsGenome Variation Format (GVF) is used to store variation data. It can be found on the Ensembl FTP site.
GWAS catalogPhenotype source databaseA curated database that extracts associations between variants and genes from published genome-wide association studies in human. Ensembl display phenotypes from the GWAS catalog. https://www.ebi.ac.uk/gwas/
Haplotype (genome)Alternative sequenceKnown variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced.
Haplotype (variation)Linkage disequilibriumA set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together.
HapMapVariation source databaseAn international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation using genotyping. Ensembl display sample genotypes and population frequencies from the HapMap project. https://www.genome.gov/10001688/international-hapmap-project/
Hard maskedRepeat maskingHard masked sequence is repeat masked with the repeat sequences replaced by Ns. Hard masked sequence files on the Ensembl FTP site have "rm" in their file name.
HGMDVariation source databaseProject aiming to collate all known (published) gene lesions responsible for human inherited disease. Full HGMD access is restricted to license holders so Ensembl supports the minimal public data release which consists of variant/mutation names and locations. http://www.hgmd.cf.ac.uk/ac/index.php
HGNCGene source databaseHGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication. HGNC gene names are used for Ensembl human genes, where available, and for orthologous genes in other species. https://www.genenames.org/
HGVS nomenclatureFile formatsA set of recomendations for variant naming. The nomenclature describes the change a variant allele has on a named (genomic, transcript or protein) sequence. Can be used as an input for the VEP and displayed for known variants. http://varnomen.hgvs.org/
High impact variant consequenceVariant impactThe variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.
Highest population MAFMinor allele frequencyThe highest minor allele frequency observed in any population typed for this variant. For human this includes the 1000 Genomes Project, gnomAD and UK10K.
Histone modificationEpigenome evidenceCovalent modifications to the histone proteins that make up the nucleosome, which are known to regulate gene expression.
HomoeologuesHomologuesPairs of genes in a polyploid genome that underwent (a) hybridisation event(s). The original genes were orthologues in the two (or more) species that hybridised, and now occur in the same species. Since they did not arise through a duplication event, they are not paralogues.
HomologuesGeneSpecific genes that are descended from the same common sequence in an ancestor.
IdentityAlignmentsA measure of how similar two alignment sequences are, specifically, what percentage of amino acids or nucleotides are the same in type and position between the two sequences. The value is dependent on which sequence is used as the reference, since it is a percentage of that reference.
IG C geneIG geneConstant chain immunoglobulin gene that undergoes somatic recombination before transcription
IG D geneIG geneDiversity chain immunoglobulin gene that undergoes somatic recombination before transcription
IG geneBiotypeImmunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
IG J geneIG geneJoining chain immunoglobulin gene that undergoes somatic recombination before transcription
IG pseudogenePseudogeneInactivated immunoglobulin gene.
IG V geneIG geneVariable chain immunoglobulin gene that undergoes somatic recombination before transcription
IMGTGene source databaseInternational ImMunoGeneTics information system. Database of immunoglobulin and T-cell receptor annotation. We collaborate with IMGT on manual annotation of somatically recombined genes. http://www.imgt.org/
IMPCPhenotype source databaseAn international scientific endeavour to create and characterise the phenotype of 20,000 knockout mouse strains. Ensembl display phenotypes from the IMPC. http://www.mousephenotype.org/
InactiveRegulatory activityWhen a regulatory feature bears no epigenetic modifications from the ones included in the Regulatory Build.
Incomplete terminal codon variantVariant consequenceA sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed
IndelSequence variantAn insertion and a deletion, affecting two or more nucleotides
Inframe deletionVariant consequenceAn inframe non synonymous variant that deletes bases from the coding sequenc
Inframe insertionVariant consequenceAn inframe non synonymous variant that inserts bases into in the coding sequenc
INSDCGene source databaseAn international consortium between the ENA, GenBank and DDBJ to share submissions of nucleotide sequence. These sequences are used as evidence for annotating Ensembl genes. http://www.insdc.org/
InsertionSequence variantInsertion of one or more nucleotides
Interchromosomal breakpointTranslocationA rearrangement breakpoint between two different chromosomes.
Interchromosomal translocationTranslocationA translocation where the regions involved are from different chromosomes.
Intergenic variantVariant consequenceA sequence variant located in the intergenic region, between genes
InterProScanAlgorithmInterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases, including PROSITE, PRINTS, Pfam, Seg, SignalP, Gene3D, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY. Ensembl run InterProScan on all protein sequences, which uses these protein signatures to identify domains. https://www.ebi.ac.uk/interpro/
Intrachromosomal breakpointTranslocationA rearrangement breakpoint within the same chromosome.
Intrachromosomal translocationTranslocationA translocation where the regions involved are from the same chromosome.
IntronTranscriptTranscribed genomic regions that is removed from the RNA by splicing.
Intron variantVariant consequenceA transcript variant occurring within an intron
InversionStructural variantA continuous nucleotide sequence is inverted in the same position
KaryotypeGenome assemblyThe number of chromosomes of a genome.
LastZPairwise whole genome alignmentLASTZ is a program for aligning DNA sequences in a pairwise manner. Its precedesessor is BlastZ.
lincRNA (long intergenic ncRNA)Long non-coding RNA (lncRNA)Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
Linkage disequilibriumVariantA measure of how often two variants or specific sequences are inherited together.
Long non-coding RNA (lncRNA)Processed transcriptA non-coding gene/transcript >200bp in length
Loss of heterozygosityStructural variantA functional variant whereby the sequence alteration causes a loss of function of one allele of a gene.
Low complexity regionsRepeatPoly-purine or poly-pyrimidine stretches, or regions of extremely high AT or GC content.
Low impact variant consequenceVariant impactA variant that is assumed to be mostly harmless or unlikely to change protein behaviour.
LTRsRepeatLong tandem repeats.
Macro lncRNALong non-coding RNA (lncRNA)Unspliced lncRNAs that are several kb in size.
MAFFile formatsMultiple alignment format (MAF) stores genomic alignments.
Major alleleAllele (variant)The allele which is most frequent in the global population, defined in human by the 1000 Genomes Project. The major allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
MANETranscriptThe Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq to identify transcripts that match GRCh38 and are 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
MANE Plus ClinicalMANETranscripts in the MANE Plus Clinical set are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known Pathogenic or Likely Pathogenic clinical variants not reportable using the MANE Select set. Note there may be additional clinically relevant transcripts in the wider RefSeq and Ensembl/GENCODE sets but not yet in MANE.
MANE SelectMANEThe Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq. The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
Many-to-many orthologuesOrthologuesA type of orthologue assigned for a pair of species where multiple orthologues are found in both species, where the duplication events in both species occurred after the speciation event.
MarkerGenome annotationA short sequence whose placement on the genome is known.
Mature miRNA variantVariant consequenceA transcript variant located with the sequence of the mature miRNA
MetaLRAlgorithmA tool for predicting the pathogenicity of single nucleotide variants using a logistic regression based ensemble method.
MGIGene source databaseMGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI gene names are used for Ensembl mouse genes, where available. http://www.informatics.jax.org/
MicrosatelliteRepeatA region in the genomic sequence containing short tandem repeats of 2-10bp.
Minor alleleAllele (variant)The allele which is the second most frequent in the global population, defined in human by the 1000 Genomes Project. The minor allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
Minor allele frequencyMinor alleleThe frequency of the second most common allele in the specified population.
miRbaseGene source databaseThe miRBase database is a searchable database of published miRNA sequences and annotation. These sequences are used as evidence for annotating Ensembl miRNA genes. http://www.mirbase.org/
miRNAncRNAA small RNA (~22bp) that silences the expression of target mRNA.
miscRNAncRNAMiscellaneous RNA. A non-coding RNA that cannot be classified.
Missense variantVariant consequenceA sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved
Mobile element deletionStructural variantA deletion of a mobile element when comparing a reference sequence (has mobile element) to a individual sequence (does not have mobile element).
Mobile element insertionStructural variantA kind of insertion where the inserted sequence is a mobile element.
Moderate impact variant consequenceVariant impactA non-disruptive variant that might change protein effectiveness.
Modifier impact variant consequenceVariant impactUsually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.
Multiple whole genome alignmentWhole genome alignmentAn alignment between more than two whole genomes of a selected taxon.
MutationAssessorAlgorithmA tool for assessing the functional impact of single nucleotide variants based on evolutionary conservation of the affected amino acid in protein homologues.
MySQLFile formatsMySQL is a database. All Ensembl data is stored in MySQL relational tables, which can be found on the FTP site and accessed directly by MySQL queries.
NARegulatory activityWhen there is no available data in the cell type for this regulatory feature.
ncRNAProcessed transcriptA non-coding gene.
NewickFile formatsNewick is a tree format. Ensembl gene trees can be downloaded in Newick and it is used to store Ensembl species trees.
NMD transcript variantVariant consequenceA variant in a transcript that is the target of NMD
Non codingLong non-coding RNA (lncRNA)Transcripts which are known from the literature to not be protein coding.
Non coding transcript exon variantVariant consequenceA sequence variant that changes non-coding exon sequence in a non-coding transcript
Non coding transcript variantVariant consequenceA transcript variant of a non coding RNA gene
Non-ATG startTranscriptA transcript with a non-ATG start codon but which still encodes a methionine since the ribosomal machinery allows non-AUG to translate as methionine in specific cases.
Nonsense Mediated DecayBiotypeA transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
Novel patchPatchNovel patches represent new allelic loci. They can usually be considered as similar to haplotypes and are likely to be reclassified as such in the next genome assembly, but not necessarily.
Novel sequence insertionStructural variantAn insertion the sequence of which cannot be mapped to the reference genome.
OMIAPhenotype source databaseAn online database that describes the function and phenotypes associated with animal genes. Ensembl display phenotypes from OMIA. https://www.omia.org/
OMIMPhenotype source databaseAn online database that describes the function and phenotypes associated with human genes. Ensembl display phenotypes from OMIM and MIM morbid. https://www.omim.org/
Open chromatin regions Regulatory featuresRegions of spaced out histones, making them accessible to protein interactions.
OrphanetPhenotype source databaseA catalogue of rare disease associations. Ensembl display phenotypes from Orphanet. http://www.orpha.net/
OrthologuesHomologuesOrthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart.
OrthoXMLFile formatsOrthoXML is an XML format to allow the storage and comparison of orthology data. It is used to store Ensembl homologues.
Other paraloguesParaloguesParalogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute.
Pairwise interactions (WashU)File formatsPairwise interactions, such as those derived from Hi-C, can be stored in the WashU format and viewed in Ensembl.
Pairwise whole genome alignmentWhole genome alignmentAn alignment between two whole genomes.
PARGenome assemblySmall regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked.
ParaloguesHomologuesGenes (homologues) that have evolved by duplication.
PatchAlternative sequenceNew sequences that have been added to the genome assembly since its release. There are two types: fix and nove patches.
PDBProtein source databaseA repository for 3D biological macromolecular structure data. Ensembl provide links out to the PDB, and use structures to display the locations of variants in proteins. http://www.ebi.ac.uk/pdbe/
PeakEpigenome evidenceLocus identified from epigenome signal as being having high signal, shown as a BigBed across the genome.
PecanMultiple whole genome alignmentPecan is a global multiple sequence alignment program that makes practical the probabilistic consistency methodology for significant numbers of sequences of practically arbitrary length.
PeptideTranscriptA sequence of amino acids, translated from a CDS.
PhaseExonThe position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.
Phenotype source databaseEnsembl sourcesDatabase from which Ensembl imports phenotype associations with genes and/or variants.
PhyloXMLFile formatsPhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees (or networks) and associated data. It is used to store Ensembl phylogenetic trees.
piRNAncRNAAn RNA that interacts with piwi proteins involved in genetic silencing.
Placed scaffoldScaffoldA scaffold that can be positioned on a chromosome based on genetic mapping information.
PoisedRegulatory activityWhen a regulatory feature displays a epigenetic signature with the potential to be activated. It is analogous to a sprinter in the blocks.
Polymorphic pseudogenePseudogenePseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
PolyPhenAlgorithmA tool which predicts if missense variants are likely to affect protein function based on physical and comparative considerations. http://genetics.bwh.harvard.edu/pph2/
Primary assemblyGenome assemblyThe underlying genome sequence, without alternative sequence included.
Private alleleAllele (variant)An allele which has only been identified in one individual or one family. A private allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
ProbeStructural variantA DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid.
Processed pseudogenePseudogenePseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
Processed transcriptBiotypeGene/transcript that doesn't contain an open reading frame (ORF).
Progressive cactusMultiple whole genome alignmentProgressive-Cactus is a next-generation aligner that stores whole-genome alignments in a graph structure.
Projection buildEnsembl GenebuildA gene build method used by Ensembl for low coverage genomes, allowing genes to be annotated that span two scaffolds by mapping to the human gene.
Promoter flanking regions Regulatory featuresTranscription factor binding regions that flank promoters.
Promoters Regulatory featuresRegions at the 5' end of genes where transcription factors and RNA polymerase bind to initiate transcription.
Protein altering variantVariant consequenceA sequence_variant which is predicted to change the protein encoded in the coding sequence
Protein codingBiotypeGene/transcipt that contains an open reading frame (ORF).
Protein coding CDS not definedBiotypeAlternatively spliced transcript of a protein coding gene for which we cannot define a CDS.
Protein coding LOFBiotypeNot translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. Replaces the polymorphic_pseudogene transcript biotype.
Protein domainPeptideA region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.
PseudogeneBiotypeA gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
PSLFile formatsPSL represents alignments and can be viewed in Ensembl.
QTLVariantGenetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure).
r2Linkage disequilibriumThe correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited).
RDFFile formatsResource Description Framework (RDF) is used as a metadata data model. Ensembl use it to describe links from Ensembl annotations to those annotations in other databases.
ReadthroughBiotypeA readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
Reference alleleAllele (variant)The allele of a variant found in the reference genome currently being studied. The reference allele is not necessarily the major or ancestral allele.
RefSeqGene source databaseNCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products. https://www.ncbi.nlm.nih.gov/refseq/
RefSeq MatchMANE SelectRefSeq transcripts that match 100% across the sequence, exon/intron structure and UTRs as part of the MANE project
Regulatory activityRegulatory featuresThe activity state of a regulatory feature in a specific epigenome.
Regulatory featuresGenome annotationRegions that are predicted to regulate the expression of genes, based on the Ensembl regulatory build.
Regulatory region ablationVariant consequenceA feature ablation whereby the deleted region includes a regulatory region
Regulatory region amplificationVariant consequenceA feature amplification of a region containing a regulatory region
Regulatory region variantVariant consequenceA sequence variant located within a regulatory region
Repeat maskingRepeatThe method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.
RepeatMaskerAlgorithmThe method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. http://www.repeatmasker.org/
RepressedRegulatory activityWhen a regulatory feature is epigenetically repressed, having an epigenetic signature that prevents it from being active.
Retained intronLong non-coding RNA (lncRNA)An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
REVELAlgorithmA tool for predicting the pathogenicity of single nucleotide variants using an ensemble method.
Reverse strandGenome annotationDNA strand arbitrary defined as the strand with its 5' end at the tip of the long chromosome arm (q). If a gene is reverse-stranded, its sense (sequence matching cDNA) is on the reverse strand. Reverse strand is reverse complementary to the forward strand.
RfamGene source databaseThe Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). These sequences are used as evidence for annotating Ensembl non-coding genes. http://rfam.xfam.org/
RNA repeatsRepeatNon-functional copies of RNA genes which have been reintegrated into the genome with the assistance of a reverse transcriptase.
Roadmap EpigenomicsEpigenome source databaseProject aiming to develop publicly available reference epigenome maps from a variety of cell types. http://www.roadmapepigenomics.org/
rRNAncRNAThe RNA component of a ribosome.
Satellite repeatsRepeatMultiple copies of the same base sequence on a DNA sequence. The repeated pattern can vary in length from a single base to several thousand bases long.
ScaffoldGenome assemblyScaffolds are sets of ordered, oriented contigs, assembled by sequence overlap. They are longer sequences than contigs, but shorter than full chromosomes.
Sense intronicLong non-coding RNA (lncRNA)A long non-coding transcript in introns of a coding gene that does not overlap any exons.
Sense overlappingLong non-coding RNA (lncRNA)A long non-coding transcript that contains a coding gene in its intron on the same strand.
Sequence variantVariantVariant that only affects a small locus
SGDGene source databaseCanonical database for the molecular biology and genetics of Saccharomyces cerevisiae, source of the annotation seen in Ensembl. https://www.yeastgenome.org/
Short tandem repeat variantVariantA variation that expands or contracts a tandem repeat with regard to a reference.
SIFTAlgorithmA tool which predicts if missense variants are likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/
SignalEpigenome evidenceA count of the number of NGS reads from an epigenome experiment aligned to a locus, shown as a BigWig across the genome.
SimilarityAlignmentsHow well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides.
Simple repeatsRepeatDuplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
siRNAncRNAA small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
SliceGenome assemblyThe term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome.
snoRNAncRNASmall RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
SNPSequence variantSingle Nucleotide Polymorphism, substitution of a single nucleotide for another nucleotide
snRNAncRNASmall RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
Soft maskedRepeat maskingSoft masked sequence is repeat masked with the repeat sequences in lower case. Soft masked sequence files on the Ensembl FTP site have "sm" in their file name.
Splice acceptor variantVariant consequenceA splice variant that changes the 2 base region at the 3' end of an intron
Splice donor variantVariant consequenceA splice variant that changes the 2 base region at the 5' end of an intron
Splice region variantVariant consequenceA sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron
Start lostVariant consequenceA codon variant that changes at least one base of the canonical start codo
Stop codon readthroughBiotypeThe coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
Stop gainedVariant consequenceA sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript
Stop lostVariant consequenceA sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript
Stop retained variantVariant consequenceA sequence variant where at least one base in the terminator codon is changed, but the terminator remains
Structural variantVariantVariant that affects a large locus
SubstitutionSequence variantA sequence alteration where the length of the deleted sequence is the same as the length of the inserted sequence.
SwissProtUniProtUniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. These sequences are used as evidence for annotating Ensembl genes.
Synonymous variantVariant consequenceA sequence variant where there is no resulting change to the encoded amino acid
SyntenyWhole genome alignmentIn a genomic context we refer to syntenic regions if the sequence is globally conserved between two species.
TAGENETranscriptLong-read sequence data is computationally processed into non-redundant transcript models which are manually appraised by the Ensembl-Havana annotation team.
Tandem duplicationStructural variantA duplication consisting of 2 identical adjacent regions.
Tandem repeatVariantTwo or more adjacent copies of a region (of length greater than 1).
Tandem repeatsRepeatTypically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
TEC (To be Experimentally Confirmed)BiotypeRegions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
TF binding site variantVariant consequenceA sequence variant located within a transcription factor binding site
TFBS ablationVariant consequenceA feature ablation whereby the deleted region includes a transcription factor binding site
TFBS amplificationVariant consequenceA feature amplification of a region containing a transcription factor binding site
ToplevelGenome assemblyThe largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.
TOPMedVariation source databaseWhole genome variant calling data from humans worldwide with heart, lung, blood, and sleep disorders. Ensembl display population frequencies from TOPMed. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program
TR C geneTR geneConstant chain T cell receptor gene that undergoes somatic recombination before transcription
TR D geneTR geneDiversity chain T cell receptor gene that undergoes somatic recombination before transcription
TR geneBiotypeT cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
TR J geneTR geneJoining chain T cell receptor gene that undergoes somatic recombination before transcription
TR V geneTR geneVariable chain T cell receptor gene that undergoes somatic recombination before transcription
Transcribed pseudogenePseudogenePseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
TranscriptGeneA transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a protein
Transcript ablationVariant consequenceA feature ablation whereby the deleted region includes a transcript feature
Transcript amplificationVariant consequenceA feature amplification of a region containing a transcript
Transcript haplotypeHaplotype (variation)The transcript sequence derived from one copy of a gene in an individual, based on the phased 1000 Genomes genotype data. CDS and protein sequences are derived from this.
Transcript support levelTranscriptThe Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users, based on the type and quality of the alignments used to annotate the transcript.
Transcription factorEpigenome evidenceA protein that binds to DNA and controls the rate of transcription.
Transcription factor binding motifRegulatory featuresShort genomic sequence that is known to bind to a particular transcription factor.
Transcription factor binding sites Regulatory featuresSites which bind transcription factors, for which no other role can be determined as yet.
Translated BlatPairwise whole genome alignmentTranslated Blat can be used for alignment of the coding regions of genomes only in a pairwise manner.
Translated pseudogenePseudogenePseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
TranslocationStructural variantA region of nucleotide sequence that has translocated to a new position
TrEMBLUniProtA subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the ENA (formerly EMBL-bank) that are not yet incorporated into the UniProt/SwissProt database. These sequences are used as evidence for annotating Ensembl genes.
tRNAncRNAA transfer RNA, which acts as an adaptor molecule for translation of mRNA.
TSL 1Transcript support levelA transcript where all splice junctions are supported by at least one non-suspect mRNA.
TSL 2Transcript support levelA transcript where the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
TSL 3Transcript support levelA transcript where the only support is from a single EST
TSL 4Transcript support levelA transcript where the best supporting EST is flagged as suspect
TSL 5Transcript support levelA transcript where no single transcript supports the model structure.
TSL NATranscript support levelA transcript that was not analysed for TSL.
Type I Transposons/LINERepeatLong Interspersed Elements. Retrotransposed elements in the genome containing open reading frames encoding (often inactive) reverse transcription machinery.
Type I Transposons/SINERepeatShort Interspersed Elements. Retrotransposed elements less than 500 bp that contain tRNA, snRNA and rRNA, which require other mobile elements to be transposed. Alu elements are a type of SINE.
Type II TransposonsRepeatElements that have been transposed and duplicated around the genome by excision and ligation.
UCSC Genome BrowserGene source databaseA genome browser hosted at the University of California Santa Cruz. Ensembl collaborates with UCSC in projects such as GENCODE, CCDS and TSL. https://genome.ucsc.edu/
UK10KVariation source databaseStudy comparing exomes of 6000 diseased individuals with 4000 healthy individuals in the UK in order to identify disease-causing variants. Ensembl display population frequencies from the control group. https://www.uk10k.org/
UniProtGene source databaseDatabase of protein sequence and functional information, based at European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). These sequences are used as evidence for annotating Ensembl genes. http://www.uniprot.org/
UniProt MatchUniProtThe UniProt identifier that matches to the Ensembl transcript. This may be a UniProt protein isoform and will have a number suffix, or may just refer to a UniProt entry.
UniSTSMarkerUniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.
Unitary pseudogenePseudogeneA species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
Unknown repeatRepeatRepeats that cannot be classified.
Unplaced scaffoldScaffoldA scaffold that cannot be positioned on a chromosome.
Unprocessed pseudogenePseudogenePseudogene that can contain introns since produced by gene duplication.
Untranslated regionTranscriptThe region of a coding cDNA which is not translated.
Upstream gene variantVariant consequenceA sequence variant located 5' of a gene
VariantGenome annotationLocus where the sequence differs between individuals of the same species
Variant consequenceVariantThe effect that the variant has on each feature that it overlaps. A variant will have a consequence for each feature that it overlaps.
Variant impactVariant consequenceA subjective classification of the severity of the variant consequence, based on agreement with SNPEff.
Variation source databaseEnsembl sourcesDatabase from which Ensembl imports variation data, including loci, sample genotypes, population frequencies and phenotype associations.
vaultRNAncRNAShort non coding RNA genes that form part of the vault ribonucleoprotein complex.
VCFFile formatsVCF is a standard format for listing genetic variation, which is the output for many variant callers. It can be used as an input for the Ensembl VEP and is used to store and download variation data in Ensembl.
VEPAlgorithmThe Variant Effect Predictor (VEP) is an Ensembl tool that predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
VEP cacheFile formatsA VEP cache contains all the gene and variant data needed to run a VEP query, and can be used to run large queries quickly on your own machine. These can be installed as part of your VEP installtion, or downloaded from the FTP site.
WasabiAlignmentsAn application for displaying sequence alignments with custom colour-annotation, which is used by Ensembl displaying gene tree and family alignments. http://wasabiapp.org/
Whole genome alignmentAlignmentsAn alignment carried out using the whole genome sequence.
WiggleFile formatsWiggle format expresses scores across genomic loci, requiring fixed size bins for the scores. It can be uploaded to view in Ensembl.
Within species paraloguesParaloguesTwo or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node.
YACCloneOriginated from a bacterial plasmid, a YAC contains a yeast centromeric region, a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.
zFINGene source databaseAn online biological database of information about the zebrafish (Danio rerio). zFIN gene names are used for Ensembl zebrafish genes, where available. https://zfin.org/