Ensembl Glossary

Search for a term

Browse full list of terms

Term	Category	Description
1-to-1 orthologues	Orthologues	A type of orthologue assigned for a pair of species where only one copy is found in each species.
1-to-many orthologues	Orthologues	A type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species, due to (a) duplication event(s) in the second species.
1000 Genomes project	Variation source database	The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the human populations studied. Ensembl display sample genotypes and population frequencies from the 1000 Genomes project. http://www.internationalgenome.org/
3 prime UTR variant	Variant consequence	A UTR variant of the 3' UTR
3' incomplete	Transcript	A protein-coding transcript which is missing the stop codon due to incomplete evidence.
3' overlapping ncRNA	Long non-coding RNA (lncRNA)	Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
3' UTR	Transcript	The region of a coding cDNA downstream of the stop codon which is not translated.
5 prime UTR variant	Variant consequence	A UTR variant of the 5' UTR
5' incomplete	Transcript	A protein-coding transcript which is missing the start codon due to incomplete evidence.
5' UTR	Transcript	The region of a coding cDNA upstream of the start codon which is not translated.
Active	Regulatory activity	When a regulatory feature displays an epigenetic signature which is consistent with it carrying out its named function, for example an active Promoter has an epigenetic signature consistent with initiating transcription, while an active CTCF binding site will bind CTCF. It is analogous to a sprinter running.
AGP	File formats	A golden path. A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.
Algorithm		A sequence of computational tasks or actions that carry out a specific function.
Alignments	Genome annotation	A comparison between two or more sequences by matching identical and/or similar residues/nucleotides and assigning a score to the match.
Allele (gene)	Gene	Different versions of a gene found between the primary assembly and a patch or genome haplotype.
Allele (variant)	Variant	One of a number of alternative forms of the same genetic locus/variant.
Alternative allele	Allele (variant)	Any allele of a variant which is not the in the reference genome currently being studied. The alternative allele is not necessarily the minor allele.
Alternative sequence	Genome assembly	Genomic sequence that differs from the genomic DNA on the primary assembly. These are represented as sequence on top of the primary assembly. Provided by the GRC for human and mouse.
Alu insertion	Repeat	A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it.
Ambiguity code	Variant	A single letter code that represents two or more possible nucleotides at a single base locus.
Ancestral allele	Allele (variant)	The allele which occurs at this locus in closely related species and is thought to reflect the allele present at the time of speciation. The ancestral allele may be the reference or the alternative allele, and the major or minor allele.
Animal QTLdb	Phenotype source database	Project aiming to house all publicly available QTL and association data on livestock animal species. Ensembl display phenotypes from the Animals QTLdb. https://www.animalgenome.org/cgi-bin/QTLdb/index
Antisense	Long non-coding RNA (lncRNA)	Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
APPRIS	Transcript	APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods to identify the most functionally important transcript(s) of a gene.
APPRIS ALT1	APPRIS	For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that is conserved in at least three tested species.
APPRIS ALT2	APPRIS	For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that appear to be conserved in fewer than three tested species.
APPRIS P1	APPRIS	Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
APPRIS P2	APPRIS	Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
APPRIS P3	APPRIS	Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
APPRIS P4	APPRIS	Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
APPRIS P5	APPRIS	Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
BAC	Clone	A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria. Many genomes (such as human) were sequenced by cloning segments into BACs, amplifying and sequencing the clones.
BAM/CRAM	File formats	BAM and CRAM store alignments of NGS data to the genome. Ensembl allow attachment of BAM and CRAM files to view in against the gene, and store RNA-seq, ChIP-seq and DNase-seq in BAM.
Base pairs (genome size)	Genome assembly	The actual number of bases of sequence we have for a full genome assembly, including alternative sequences and PARs, excluding gaps.
BED	File formats	BED is a simple format for listing genomic loci. It can be used to upload data to view in Ensembl, as a custom file for additional VEP annotation and is used to store and download constrained elements in Ensembl.
BedGraph	File formats	BedGraph allows you to store scores for loci in BED format, the loci can be of varying size. It can be uploaded to view in Ensembl.
Between species paralogues	Paralogues	Members of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node.
BigBed	File formats	BigBed is an indexed form of BED, which can be used to store larger scale data. Ensembl allow attachment of BigBed files to view against the genome and store peaks of regulatory evidence as BigBed.
BigWig	File formats	BigWig is an indexed form of wiggle and can be used to store larger scale data. Ensembl simplify NGS data, such as ChIP-seq and RNA-seq into BigWig to view in the browser. It can also be used to attach your own data to Ensembl.
Biotype	Gene	A gene or transcript classification.
Bisulfite sequencing	Epigenome evidence	A method to determine the methylation of genomic cytosines.
BLAST	Algorithm	A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query.
BlastZ	Pairwise whole genome alignment	BlastZ is a program for aligning DNA sequences in a pairwise manner. It has been replaced by LASTZ.
BLAT	Algorithm	An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more.
BLOSUM 62	Algorithm	A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members).
Blueprint Epigenomes	Epigenome source database	Project aiming to apply functional genomics analysis on primary cells of the haematopoietic cell lineage from healthy and diseased individuals, to produce lineage-specific epigenomes. Used as a source for the Ensembl regulatory build. http://www.blueprint-epigenome.eu/
CADD	Algorithm	A tool that integrates multiple annotations into one metric for scoring the deleteriousness of single nucleotide variants.
CCDS	Transcript	A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, MGI, HGNC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.
cDNA	Transcript	The sequence of the spliced exons of a transcript expressed in DNA notation (T rather than U), representing the coding or sense strand. The cDNA contains the whole sequence of the RNA, including coding and untranslated sequence.
CDS	Transcript	CoDing Sequence. The region of a cDNA which is translated. In Ensembl displays, the stop codon is included as part of the CDS sequence.
Centromere	Repeat	The region of the chromosome at which the two sister chromatids are joined during mitosis and meiosis, mostly composed of satellite DNA.
chain	File formats	Chain files describe the mapping between different genome assemblies. Ensembl store these on the FTP site.
ChIP-seq	Epigenome evidence	A method to determine the genomic regions that proteins bind to.
CIGAR	Alignments	The cigar line defines the sequence of matches/mismatches and deletions (or gaps) in an alignment
Clinical significance	Variant	A classification of a variant's impact on disease, taken from ClinVar.
ClinVar	Variation source database	NCBI resource that aggregates information about genomic variation and its relationship to human health. Ensembl display clinical significance and phenotypes from ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/
Clone	Genome assembly	A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.
CNV	Structural variant	Copy Number Variation: increases or decreases the copy number of a given locus. Subcategorised into Loss and Gain compared to the reference.
Coding sequence variant	Variant consequence	A sequence variant that changes the coding sequence
Codon	CDS	Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).
Complex structural alteration	Structural variant	A structural sequence alteration or rearrangement encompassing one or more genome fragments, with four or more breakpoints.
Complex substitution	Structural variant	When no simple or well defined DNA mutation event describes the observed DNA change, the keyword ""complex"" should be used. Usually there are multiple equally plausible explanations for the change.
Constitutive exon	Exon	Exons that are not spliced out, therefore present in all transcripts of a given gene.
Contig	Genome assembly	A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information.
Coordinate system	Genome assembly	Which level of the assembly we are working on.
COSMIC	Variation source database	Database of somatic variants found in cancer. COSMIC licensing does not permit redistribution of the full dataset, but mutation identifiers, locations and tumour types are available in Ensembl. http://cancer.sanger.ac.uk/cosmic
Cosmid	Clone	DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.
Coverage	Genome assembly	Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information.
CTCF binding sites	Regulatory features	Regions that bind CTCF, the insulator protein that demarcates open and closed chromatin.
Cytogenetic band	Genome assembly	A banding pattern on a chromosome resulting from staining and examination by microscopy. These are named in terms of the chromosome arm they are found on, and are often used as a shorthand for describing the location of genomic features.
D'	Linkage disequilibrium	The difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0.
dbSNP	Variation source database	The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms in human, maintained by NCBI. https://www.ncbi.nlm.nih.gov/projects/SNP/
dbVar	Variation source database	dbVar is NCBI's database of human genomic structural variation — insertions, deletions, duplications, inversions, mobile elements, and translocations. https://www.ncbi.nlm.nih.gov/dbvar/
DDBJ	INSDC	The Asian branch of INSDC. http://www.ddbj.nig.ac.jp/
Deletion	Sequence variant	Deletion of one or more nucleotides
DGVa	Variation source database	The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.https://www.ebi.ac.uk/dgva
DNA methylation	Epigenome evidence	Modification of cytosines in CpGs with methyl groups, which is known to repress gene expression.
DNase sensitivity	Epigenome evidence	A method to determine regions of open and closed chromatin.
Downstream gene variant	Variant consequence	A sequence variant located 3' of a gene
DUST	Algorithm	A standalone application that looks for low complexity sequences.
EMBL (file format)	File formats	EMBL files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
EMF Alignment format	File formats	Ensembl Multi Format (EMF) stores genomic alignments in Ensembl.
ENA	INSDC	Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. https://www.ebi.ac.uk/ena
ENCODE	Epigenome source database	Project aiming to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, by large scale functional analyses of laboratory cell lines. Used as a source for the Ensembl regulatory build. https://www.encodeproject.org/
Enhancers	Regulatory features	Regions that bind transcription factors and interact with promoters to stimulate transcription of distant genes.
Ensembl canonical	Transcript	A single transcript chosen for a gene which is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. This is defined in detail on http://www.ensembl.org/info/genome/genebuild/canonical.html
Ensembl default (VEP)	File formats	Ensembl default is an input format for the VEP, used to describe the position and alleles of a variant.
Ensembl gene tree pipeline	Algorithm	The process by which Ensembl compare gene sequences in order to construct gene trees and predict homologues.
Ensembl Genebuild	Algorithm	The automatic process by which Ensembl plot known RNA and protein sequence onto the genome, using sequence similarity.
Ensembl Havana	Ensembl Genebuild	Human And Vertebrate ANalysis and Annotation. The team within Ensembl who manually annotate genes and transcripts for a subset of species.
Ensembl Regulatory Build	Algorithm	The process by which Ensembl predict the location of regions that regulate gene expression using epigenomic evidence.
Ensembl sources		Publicly available database that Ensembl imports data from.
Epigenome	Regulatory activity	A cell type, such as a primary tissue or lab cell line, for which we have epigenome evidence and can predict regulatory features.
Epigenome evidence	Regulatory features	Experimental data that is used to construct and determine activity of regulatory features.
Epigenome source database	Ensembl sources	Database from which Ensembl imports ChIP-seq, DNase-seq and other related datasets, which are used in the Ensembl regulatory build.
EPO	Multiple whole genome alignment	The EPO (Enredo, Pecan, Ortheus) pipeline is a three step pipeline for whole-genome multiple alignments, using Enredo segments, aligning them with Pecan and constructing ancestal sequences with Ortheus.
Eponine	Algorithm	Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognising specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. http://www.sanger.ac.uk/science/tools/eponine
eQTL	QTL	Genetic loci where allelic variation is associated with expression levels of other genes.
EST	Genome annotation	Expressed Sequence Tag. Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.
EVA	Variation source database	The European Variation Archive is an open-access database of all types of genetic variation data from all species. https://www.ebi.ac.uk/eva/
Evidence status	Variant	Codes that reflect the amount and type of evidence that supports the existence of a variant.
Exon	Transcript	Transcribed genomic region that remains in the RNA after splicing, includes both the CDS and the UTRs.
Exonerate	Algorithm	A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
External reference	Genome annotation	Mapping between Ensembl genes, transcripts and proteins to the same features in other databases.
FASTA	File formats	FASTA is used to store finished nucleotide and peptide sequences. The Ensembl FTP site has genome, cDNA, CDS and peptide sequences in FASTA, and you can export FASTA from various webpages in Ensembl.
Feature elongation	Variant consequence	A sequence variant located within a regulatory region
Feature truncation	Variant consequence	A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence
Fix patch	Patch	Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence.
Flagged variant	Variant	Variants that failed our quality control analyses, therefore they are flagged as suspicious.
Flanking sequence	Genome annotation	Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).
Forward strand	Genome annotation	DNA strand arbitrary defined as the strand with its 5' end at the tip of the short chromosome arm (p). If a gene is forward-stranded, its sense (sequence matching cDNA) is on the forward strand. Forward strand is reverse complementary to the reverse strand.
Frameshift variant	Variant consequence	A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three
GenBank (database)	INSDC	The US branch of INSDC. https://www.ncbi.nlm.nih.gov/genbank/
GenBank (file format)	File formats	GenBank files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome.
GENCODE	Gene source database	The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human and mouse genomes at a high accuracy. The GENCODE gene set is the default geneset in Ensembl and is equivalent to the Ensembl/HAVANA merged genes. https://www.gencodegenes.org/
GENCODE Basic	Transcript	A subset of the GENCODE transcript set, containing only 5' and 3' complete transcripts.
GENCODE Comprehensive	Transcript	The full GENCODE transcript set, containing both complete transcripts and 5' and 3' incomplete transcripts.
Gene	Genome annotation	Genomic locus where transcription occurs. A gene may have one or more transcripts, which may or may not encode proteins.
Gene Ontology	Gene source database	An organised hierarchy of terms produced by the Gene Ontology Consortium, used to describe the function of proteins. GO terms are split into three subcategories: biological processes (what the protein does), cellular component (where in the cell the protein is found), and molecular function (how the protein acts). http://www.geneontology.org/
Gene source database	Ensembl sources	Database from which Ensembl imports cDNA or protein sequence for gene annotation, or gene names.
Gene split	Paralogues	Pairs of genes in a species that occur together in the same tree, but are actually two halves of the same gene split partway along.
Gene tree	Homologues	A representation of the evolutionary relationship between homologues, constructed using the Ensembl gene tree pipeline.
Genetic marker	Variant	A measurable locus that varies within a population.
GeneWise	Algorithm	GeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/Tools/psa/genewise/
Genome		The complete set of DNA found in each cell.
Genome annotation		A genomic locus that has been annotated.
Genome assembly	Genome	A computational representation of the sequence of a haploid genome, representative of a species or strain.
Genotype	Allele (variant)	The specific alleles that are present in an individual's genome. In diploid organisms two alleles make up the genotype (except for the sex chromosomes).
GENSCAN	Algorithm	An HMM-based ab initio gene prediction method, used to create a track of ab initio genes in Ensembl. http://genes.mit.edu/GENSCAN.html
GFF	File formats	GFF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GFF, allow attachment of GFF files to view against the genome and allow custom annotation with the VEP using GFF files.
Global MAF	Minor allele frequency	The frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3.
gnomAD	Variation source database	An aggregation of publicly available whole genome and whole exome variant calling experiments in human. GnomAD was previously known as ExAC, when it contained only exome data. Ensembl display population frequencies from gnomAD. http://gnomad.broadinstitute.org/
Golden path (genome size)	Genome assembly	The golden path is the length of the non-redundant reference assembly. It excludes alternative sequences and PARs, but includes the estimated size of the gaps.
GTF	File formats	GTF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GTF, allow attachment of GTF files to view against the genome and allow custom annotation with the VEP using GTF files.
GVF	File formats	Genome Variation Format (GVF) is used to store variation data. It can be found on the Ensembl FTP site.
GWAS catalog	Phenotype source database	A curated database that extracts associations between variants and genes from published genome-wide association studies in human. Ensembl display phenotypes from the GWAS catalog. https://www.ebi.ac.uk/gwas/
Haplotype (genome)	Alternative sequence	Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced.
Haplotype (variation)	Linkage disequilibrium	A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together.
HapMap	Variation source database	An international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation using genotyping. Ensembl display sample genotypes and population frequencies from the HapMap project. https://www.genome.gov/10001688/international-hapmap-project/
Hard masked	Repeat masking	Hard masked sequence is repeat masked with the repeat sequences replaced by Ns. Hard masked sequence files on the Ensembl FTP site have "rm" in their file name.
HGMD	Variation source database	Project aiming to collate all known (published) gene lesions responsible for human inherited disease. Full HGMD access is restricted to license holders so Ensembl supports the minimal public data release which consists of variant/mutation names and locations. http://www.hgmd.cf.ac.uk/ac/index.php
HGNC	Gene source database	HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication. HGNC gene names are used for Ensembl human genes, where available, and for orthologous genes in other species. https://www.genenames.org/
HGVS nomenclature	File formats	A set of recomendations for variant naming. The nomenclature describes the change a variant allele has on a named (genomic, transcript or protein) sequence. Can be used as an input for the VEP and displayed for known variants. http://varnomen.hgvs.org/
High impact variant consequence	Variant impact	The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.
Highest population MAF	Minor allele frequency	The highest minor allele frequency observed in any population typed for this variant. For human this includes the 1000 Genomes Project, gnomAD and UK10K.
Histone modification	Epigenome evidence	Covalent modifications to the histone proteins that make up the nucleosome, which are known to regulate gene expression.
Homoeologues	Homologues	Pairs of genes in a polyploid genome that underwent (a) hybridisation event(s). The original genes were orthologues in the two (or more) species that hybridised, and now occur in the same species. Since they did not arise through a duplication event, they are not paralogues.
Homologues	Gene	Specific genes that are descended from the same common sequence in an ancestor.
Identity	Alignments	A measure of how similar two alignment sequences are, specifically, what percentage of amino acids or nucleotides are the same in type and position between the two sequences. The value is dependent on which sequence is used as the reference, since it is a percentage of that reference.
IG C gene	IG gene	Constant chain immunoglobulin gene that undergoes somatic recombination before transcription
IG D gene	IG gene	Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription
IG gene	Biotype	Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
IG J gene	IG gene	Joining chain immunoglobulin gene that undergoes somatic recombination before transcription
IG pseudogene	Pseudogene	Inactivated immunoglobulin gene.
IG V gene	IG gene	Variable chain immunoglobulin gene that undergoes somatic recombination before transcription
IMGT	Gene source database	International ImMunoGeneTics information system. Database of immunoglobulin and T-cell receptor annotation. We collaborate with IMGT on manual annotation of somatically recombined genes. http://www.imgt.org/
IMPC	Phenotype source database	An international scientific endeavour to create and characterise the phenotype of 20,000 knockout mouse strains. Ensembl display phenotypes from the IMPC. http://www.mousephenotype.org/
Inactive	Regulatory activity	When a regulatory feature bears no epigenetic modifications from the ones included in the Regulatory Build.
Incomplete terminal codon variant	Variant consequence	A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed
Indel	Sequence variant	An insertion and a deletion, affecting two or more nucleotides
Inframe deletion	Variant consequence	An inframe non synonymous variant that deletes bases from the coding sequenc
Inframe insertion	Variant consequence	An inframe non synonymous variant that inserts bases into in the coding sequenc
INSDC	Gene source database	An international consortium between the ENA, GenBank and DDBJ to share submissions of nucleotide sequence. These sequences are used as evidence for annotating Ensembl genes. http://www.insdc.org/
Insertion	Sequence variant	Insertion of one or more nucleotides
Interchromosomal breakpoint	Translocation	A rearrangement breakpoint between two different chromosomes.
Interchromosomal translocation	Translocation	A translocation where the regions involved are from different chromosomes.
Intergenic variant	Variant consequence	A sequence variant located in the intergenic region, between genes
InterProScan	Algorithm	InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases, including PROSITE, PRINTS, Pfam, Seg, SignalP, Gene3D, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY. Ensembl run InterProScan on all protein sequences, which uses these protein signatures to identify domains. https://www.ebi.ac.uk/interpro/
Intrachromosomal breakpoint	Translocation	A rearrangement breakpoint within the same chromosome.
Intrachromosomal translocation	Translocation	A translocation where the regions involved are from the same chromosome.
Intron	Transcript	Transcribed genomic regions that is removed from the RNA by splicing.
Intron variant	Variant consequence	A transcript variant occurring within an intron
Inversion	Structural variant	A continuous nucleotide sequence is inverted in the same position
Karyotype	Genome assembly	The number of chromosomes of a genome.
LastZ	Pairwise whole genome alignment	LASTZ is a program for aligning DNA sequences in a pairwise manner. Its precedesessor is BlastZ.
lincRNA (long intergenic ncRNA)	Long non-coding RNA (lncRNA)	Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
Linkage disequilibrium	Variant	A measure of how often two variants or specific sequences are inherited together.
Long non-coding RNA (lncRNA)	Processed transcript	A non-coding gene/transcript >200bp in length
Loss of heterozygosity	Structural variant	A functional variant whereby the sequence alteration causes a loss of function of one allele of a gene.
Low complexity regions	Repeat	Poly-purine or poly-pyrimidine stretches, or regions of extremely high AT or GC content.
Low impact variant consequence	Variant impact	A variant that is assumed to be mostly harmless or unlikely to change protein behaviour.
LTRs	Repeat	Long tandem repeats.
Macro lncRNA	Long non-coding RNA (lncRNA)	Unspliced lncRNAs that are several kb in size.
MAF	File formats	Multiple alignment format (MAF) stores genomic alignments.
Major allele	Allele (variant)	The allele which is most frequent in the global population, defined in human by the 1000 Genomes Project. The major allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
MANE	Transcript	The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq to identify transcripts that match GRCh38 and are 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
MANE Plus Clinical	MANE	Transcripts in the MANE Plus Clinical set are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known Pathogenic or Likely Pathogenic clinical variants not reportable using the MANE Select set. Note there may be additional clinically relevant transcripts in the wider RefSeq and Ensembl/GENCODE sets but not yet in MANE.
MANE Select	MANE	The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq. The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR.
Many-to-many orthologues	Orthologues	A type of orthologue assigned for a pair of species where multiple orthologues are found in both species, where the duplication events in both species occurred after the speciation event.
Marker	Genome annotation	A short sequence whose placement on the genome is known.
Mature miRNA variant	Variant consequence	A transcript variant located with the sequence of the mature miRNA
MetaLR	Algorithm	A tool for predicting the pathogenicity of single nucleotide variants using a logistic regression based ensemble method.
MGI	Gene source database	MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI gene names are used for Ensembl mouse genes, where available. http://www.informatics.jax.org/
Microsatellite	Repeat	A region in the genomic sequence containing short tandem repeats of 2-10bp.
Minor allele	Allele (variant)	The allele which is the second most frequent in the global population, defined in human by the 1000 Genomes Project. The minor allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
Minor allele frequency	Minor allele	The frequency of the second most common allele in the specified population.
miRbase	Gene source database	The miRBase database is a searchable database of published miRNA sequences and annotation. These sequences are used as evidence for annotating Ensembl miRNA genes. http://www.mirbase.org/
miRNA	ncRNA	A small RNA (~22bp) that silences the expression of target mRNA.
miscRNA	ncRNA	Miscellaneous RNA. A non-coding RNA that cannot be classified.
Missense variant	Variant consequence	A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved
Mobile element deletion	Structural variant	A deletion of a mobile element when comparing a reference sequence (has mobile element) to a individual sequence (does not have mobile element).
Mobile element insertion	Structural variant	A kind of insertion where the inserted sequence is a mobile element.
Moderate impact variant consequence	Variant impact	A non-disruptive variant that might change protein effectiveness.
Modifier impact variant consequence	Variant impact	Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.
Multiple whole genome alignment	Whole genome alignment	An alignment between more than two whole genomes of a selected taxon.
MutationAssessor	Algorithm	A tool for assessing the functional impact of single nucleotide variants based on evolutionary conservation of the affected amino acid in protein homologues.
MySQL	File formats	MySQL is a database. All Ensembl data is stored in MySQL relational tables, which can be found on the FTP site and accessed directly by MySQL queries.
NA	Regulatory activity	When there is no available data in the cell type for this regulatory feature.
ncRNA	Processed transcript	A non-coding gene.
Newick	File formats	Newick is a tree format. Ensembl gene trees can be downloaded in Newick and it is used to store Ensembl species trees.
NMD transcript variant	Variant consequence	A variant in a transcript that is the target of NMD
Non coding	Long non-coding RNA (lncRNA)	Transcripts which are known from the literature to not be protein coding.
Non coding transcript exon variant	Variant consequence	A sequence variant that changes non-coding exon sequence in a non-coding transcript
Non coding transcript variant	Variant consequence	A transcript variant of a non coding RNA gene
Non-ATG start	Transcript	A transcript with a non-ATG start codon but which still encodes a methionine since the ribosomal machinery allows non-AUG to translate as methionine in specific cases.
Nonsense Mediated Decay	Biotype	A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
Novel patch	Patch	Novel patches represent new allelic loci. They can usually be considered as similar to haplotypes and are likely to be reclassified as such in the next genome assembly, but not necessarily.
Novel sequence insertion	Structural variant	An insertion the sequence of which cannot be mapped to the reference genome.
OMIA	Phenotype source database	An online database that describes the function and phenotypes associated with animal genes. Ensembl display phenotypes from OMIA. https://www.omia.org/
OMIM	Phenotype source database	An online database that describes the function and phenotypes associated with human genes. Ensembl display phenotypes from OMIM and MIM morbid. https://www.omim.org/
Open chromatin regions	Regulatory features	Regions of spaced out histones, making them accessible to protein interactions.
Orphanet	Phenotype source database	A catalogue of rare disease associations. Ensembl display phenotypes from Orphanet. http://www.orpha.net/
Orthologues	Homologues	Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart.
OrthoXML	File formats	OrthoXML is an XML format to allow the storage and comparison of orthology data. It is used to store Ensembl homologues.
Other paralogues	Paralogues	Paralogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute.
Pairwise interactions (WashU)	File formats	Pairwise interactions, such as those derived from Hi-C, can be stored in the WashU format and viewed in Ensembl.
Pairwise whole genome alignment	Whole genome alignment	An alignment between two whole genomes.
PAR	Genome assembly	Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked.
Paralogues	Homologues	Genes (homologues) that have evolved by duplication.
Patch	Alternative sequence	New sequences that have been added to the genome assembly since its release. There are two types: fix and nove patches.
PDB	Protein source database	A repository for 3D biological macromolecular structure data. Ensembl provide links out to the PDB, and use structures to display the locations of variants in proteins. http://www.ebi.ac.uk/pdbe/
Peak	Epigenome evidence	Locus identified from epigenome signal as being having high signal, shown as a BigBed across the genome.
Pecan	Multiple whole genome alignment	Pecan is a global multiple sequence alignment program that makes practical the probabilistic consistency methodology for significant numbers of sequences of practically arbitrary length.
Peptide	Transcript	A sequence of amino acids, translated from a CDS.
Phase	Exon	The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.
Phenotype source database	Ensembl sources	Database from which Ensembl imports phenotype associations with genes and/or variants.
PhyloXML	File formats	PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees (or networks) and associated data. It is used to store Ensembl phylogenetic trees.
piRNA	ncRNA	An RNA that interacts with piwi proteins involved in genetic silencing.
Placed scaffold	Scaffold	A scaffold that can be positioned on a chromosome based on genetic mapping information.
Poised	Regulatory activity	When a regulatory feature displays a epigenetic signature with the potential to be activated. It is analogous to a sprinter in the blocks.
Polymorphic pseudogene	Pseudogene	Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
PolyPhen	Algorithm	A tool which predicts if missense variants are likely to affect protein function based on physical and comparative considerations. http://genetics.bwh.harvard.edu/pph2/
Primary assembly	Genome assembly	The underlying genome sequence, without alternative sequence included.
Private allele	Allele (variant)	An allele which has only been identified in one individual or one family. A private allele may be the reference or the alternative allele, and may or may not be the ancestral allele.
Probe	Structural variant	A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid.
Processed pseudogene	Pseudogene	Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
Processed transcript	Biotype	Gene/transcript that doesn't contain an open reading frame (ORF).
Progressive cactus	Multiple whole genome alignment	Progressive-Cactus is a next-generation aligner that stores whole-genome alignments in a graph structure.
Projection build	Ensembl Genebuild	A gene build method used by Ensembl for low coverage genomes, allowing genes to be annotated that span two scaffolds by mapping to the human gene.
Promoter flanking regions	Regulatory features	Transcription factor binding regions that flank promoters.
Promoters	Regulatory features	Regions at the 5' end of genes where transcription factors and RNA polymerase bind to initiate transcription.
Protein altering variant	Variant consequence	A sequence_variant which is predicted to change the protein encoded in the coding sequence
Protein coding	Biotype	Gene/transcipt that contains an open reading frame (ORF).
Protein coding CDS not defined	Biotype	Alternatively spliced transcript of a protein coding gene for which we cannot define a CDS.
Protein coding LOF	Biotype	Not translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. Replaces the polymorphic_pseudogene transcript biotype.
Protein domain	Peptide	A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.
Pseudogene	Biotype	A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
PSL	File formats	PSL represents alignments and can be viewed in Ensembl.
QTL	Variant	Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure).
r2	Linkage disequilibrium	The correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited).
RDF	File formats	Resource Description Framework (RDF) is used as a metadata data model. Ensembl use it to describe links from Ensembl annotations to those annotations in other databases.
Readthrough	Biotype	A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
Reference allele	Allele (variant)	The allele of a variant found in the reference genome currently being studied. The reference allele is not necessarily the major or ancestral allele.
RefSeq	Gene source database	NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products. https://www.ncbi.nlm.nih.gov/refseq/
RefSeq Match	MANE Select	RefSeq transcripts that match 100% across the sequence, exon/intron structure and UTRs as part of the MANE project
Regulatory activity	Regulatory features	The activity state of a regulatory feature in a specific epigenome.
Regulatory features	Genome annotation	Regions that are predicted to regulate the expression of genes, based on the Ensembl regulatory build.
Regulatory region ablation	Variant consequence	A feature ablation whereby the deleted region includes a regulatory region
Regulatory region amplification	Variant consequence	A feature amplification of a region containing a regulatory region
Regulatory region variant	Variant consequence	A sequence variant located within a regulatory region
Repeat masking	Repeat	The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.
RepeatMasker	Algorithm	The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. http://www.repeatmasker.org/
Repressed	Regulatory activity	When a regulatory feature is epigenetically repressed, having an epigenetic signature that prevents it from being active.
Retained intron	Long non-coding RNA (lncRNA)	An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
REVEL	Algorithm	A tool for predicting the pathogenicity of single nucleotide variants using an ensemble method.
Reverse strand	Genome annotation	DNA strand arbitrary defined as the strand with its 5' end at the tip of the long chromosome arm (q). If a gene is reverse-stranded, its sense (sequence matching cDNA) is on the reverse strand. Reverse strand is reverse complementary to the forward strand.
Rfam	Gene source database	The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). These sequences are used as evidence for annotating Ensembl non-coding genes. http://rfam.xfam.org/
RNA repeats	Repeat	Non-functional copies of RNA genes which have been reintegrated into the genome with the assistance of a reverse transcriptase.
Roadmap Epigenomics	Epigenome source database	Project aiming to develop publicly available reference epigenome maps from a variety of cell types. http://www.roadmapepigenomics.org/
rRNA	ncRNA	The RNA component of a ribosome.
Satellite repeats	Repeat	Multiple copies of the same base sequence on a DNA sequence. The repeated pattern can vary in length from a single base to several thousand bases long.
Scaffold	Genome assembly	Scaffolds are sets of ordered, oriented contigs, assembled by sequence overlap. They are longer sequences than contigs, but shorter than full chromosomes.
Sense intronic	Long non-coding RNA (lncRNA)	A long non-coding transcript in introns of a coding gene that does not overlap any exons.
Sense overlapping	Long non-coding RNA (lncRNA)	A long non-coding transcript that contains a coding gene in its intron on the same strand.
Sequence variant	Variant	Variant that only affects a small locus
SGD	Gene source database	Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae, source of the annotation seen in Ensembl. https://www.yeastgenome.org/
Short tandem repeat variant	Variant	A variation that expands or contracts a tandem repeat with regard to a reference.
SIFT	Algorithm	A tool which predicts if missense variants are likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/
Signal	Epigenome evidence	A count of the number of NGS reads from an epigenome experiment aligned to a locus, shown as a BigWig across the genome.
Similarity	Alignments	How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides.
Simple repeats	Repeat	Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
siRNA	ncRNA	A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
Slice	Genome assembly	The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome.
snoRNA	ncRNA	Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
SNP	Sequence variant	Single Nucleotide Polymorphism, substitution of a single nucleotide for another nucleotide
snRNA	ncRNA	Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
Soft masked	Repeat masking	Soft masked sequence is repeat masked with the repeat sequences in lower case. Soft masked sequence files on the Ensembl FTP site have "sm" in their file name.
Splice acceptor variant	Variant consequence	A splice variant that changes the 2 base region at the 3' end of an intron
Splice donor variant	Variant consequence	A splice variant that changes the 2 base region at the 5' end of an intron
Splice region variant	Variant consequence	A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron
Start lost	Variant consequence	A codon variant that changes at least one base of the canonical start codo
Stop codon readthrough	Biotype	The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
Stop gained	Variant consequence	A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript
Stop lost	Variant consequence	A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript
Stop retained variant	Variant consequence	A sequence variant where at least one base in the terminator codon is changed, but the terminator remains
Structural variant	Variant	Variant that affects a large locus
Substitution	Sequence variant	A sequence alteration where the length of the deleted sequence is the same as the length of the inserted sequence.
SwissProt	UniProt	UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. These sequences are used as evidence for annotating Ensembl genes.
Synonymous variant	Variant consequence	A sequence variant where there is no resulting change to the encoded amino acid
Synteny	Whole genome alignment	In a genomic context we refer to syntenic regions if the sequence is globally conserved between two species.
TAGENE	Transcript	Long-read sequence data is computationally processed into non-redundant transcript models which are manually appraised by the Ensembl-Havana annotation team.
Tandem duplication	Structural variant	A duplication consisting of 2 identical adjacent regions.
Tandem repeat	Variant	Two or more adjacent copies of a region (of length greater than 1).
Tandem repeats	Repeat	Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
TEC (To be Experimentally Confirmed)	Biotype	Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
TF binding site variant	Variant consequence	A sequence variant located within a transcription factor binding site
TFBS ablation	Variant consequence	A feature ablation whereby the deleted region includes a transcription factor binding site
TFBS amplification	Variant consequence	A feature amplification of a region containing a transcription factor binding site
Toplevel	Genome assembly	The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.
TOPMed	Variation source database	Whole genome variant calling data from humans worldwide with heart, lung, blood, and sleep disorders. Ensembl display population frequencies from TOPMed. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program
TR C gene	TR gene	Constant chain T cell receptor gene that undergoes somatic recombination before transcription
TR D gene	TR gene	Diversity chain T cell receptor gene that undergoes somatic recombination before transcription
TR gene	Biotype	T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
TR J gene	TR gene	Joining chain T cell receptor gene that undergoes somatic recombination before transcription
TR V gene	TR gene	Variable chain T cell receptor gene that undergoes somatic recombination before transcription
Transcribed pseudogene	Pseudogene	Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
Transcript	Gene	A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a protein
Transcript ablation	Variant consequence	A feature ablation whereby the deleted region includes a transcript feature
Transcript amplification	Variant consequence	A feature amplification of a region containing a transcript
Transcript haplotype	Haplotype (variation)	The transcript sequence derived from one copy of a gene in an individual, based on the phased 1000 Genomes genotype data. CDS and protein sequences are derived from this.
Transcript support level	Transcript	The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users, based on the type and quality of the alignments used to annotate the transcript.
Transcription factor	Epigenome evidence	A protein that binds to DNA and controls the rate of transcription.
Transcription factor binding motif	Regulatory features	Short genomic sequence that is known to bind to a particular transcription factor.
Transcription factor binding sites	Regulatory features	Sites which bind transcription factors, for which no other role can be determined as yet.
Translated Blat	Pairwise whole genome alignment	Translated Blat can be used for alignment of the coding regions of genomes only in a pairwise manner.
Translated pseudogene	Pseudogene	Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
Translocation	Structural variant	A region of nucleotide sequence that has translocated to a new position
TrEMBL	UniProt	A subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the ENA (formerly EMBL-bank) that are not yet incorporated into the UniProt/SwissProt database. These sequences are used as evidence for annotating Ensembl genes.
tRNA	ncRNA	A transfer RNA, which acts as an adaptor molecule for translation of mRNA.
TSL 1	Transcript support level	A transcript where all splice junctions are supported by at least one non-suspect mRNA.
TSL 2	Transcript support level	A transcript where the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
TSL 3	Transcript support level	A transcript where the only support is from a single EST
TSL 4	Transcript support level	A transcript where the best supporting EST is flagged as suspect
TSL 5	Transcript support level	A transcript where no single transcript supports the model structure.
TSL NA	Transcript support level	A transcript that was not analysed for TSL.
Type I Transposons/LINE	Repeat	Long Interspersed Elements. Retrotransposed elements in the genome containing open reading frames encoding (often inactive) reverse transcription machinery.
Type I Transposons/SINE	Repeat	Short Interspersed Elements. Retrotransposed elements less than 500 bp that contain tRNA, snRNA and rRNA, which require other mobile elements to be transposed. Alu elements are a type of SINE.
Type II Transposons	Repeat	Elements that have been transposed and duplicated around the genome by excision and ligation.
UCSC Genome Browser	Gene source database	A genome browser hosted at the University of California Santa Cruz. Ensembl collaborates with UCSC in projects such as GENCODE, CCDS and TSL. https://genome.ucsc.edu/
UK10K	Variation source database	Study comparing exomes of 6000 diseased individuals with 4000 healthy individuals in the UK in order to identify disease-causing variants. Ensembl display population frequencies from the control group. https://www.uk10k.org/
UniProt	Gene source database	Database of protein sequence and functional information, based at European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). These sequences are used as evidence for annotating Ensembl genes. http://www.uniprot.org/
UniProt Match	UniProt	The UniProt identifier that matches to the Ensembl transcript. This may be a UniProt protein isoform and will have a number suffix, or may just refer to a UniProt entry.
UniSTS	Marker	UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.
Unitary pseudogene	Pseudogene	A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
Unknown repeat	Repeat	Repeats that cannot be classified.
Unplaced scaffold	Scaffold	A scaffold that cannot be positioned on a chromosome.
Unprocessed pseudogene	Pseudogene	Pseudogene that can contain introns since produced by gene duplication.
Untranslated region	Transcript	The region of a coding cDNA which is not translated.
Upstream gene variant	Variant consequence	A sequence variant located 5' of a gene
Variant	Genome annotation	Locus where the sequence differs between individuals of the same species
Variant consequence	Variant	The effect that the variant has on each feature that it overlaps. A variant will have a consequence for each feature that it overlaps.
Variant impact	Variant consequence	A subjective classification of the severity of the variant consequence, based on agreement with SNPEff.
Variation source database	Ensembl sources	Database from which Ensembl imports variation data, including loci, sample genotypes, population frequencies and phenotype associations.
vaultRNA	ncRNA	Short non coding RNA genes that form part of the vault ribonucleoprotein complex.
VCF	File formats	VCF is a standard format for listing genetic variation, which is the output for many variant callers. It can be used as an input for the Ensembl VEP and is used to store and download variation data in Ensembl.
VEP	Algorithm	The Variant Effect Predictor (VEP) is an Ensembl tool that predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
VEP cache	File formats	A VEP cache contains all the gene and variant data needed to run a VEP query, and can be used to run large queries quickly on your own machine. These can be installed as part of your VEP installtion, or downloaded from the FTP site.
Wasabi	Alignments	An application for displaying sequence alignments with custom colour-annotation, which is used by Ensembl displaying gene tree and family alignments. http://wasabiapp.org/
Whole genome alignment	Alignments	An alignment carried out using the whole genome sequence.
Wiggle	File formats	Wiggle format expresses scores across genomic loci, requiring fixed size bins for the scores. It can be uploaded to view in Ensembl.
Within species paralogues	Paralogues	Two or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node.
YAC	Clone	Originated from a bacterial plasmid, a YAC contains a yeast centromeric region, a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.
zFIN	Gene source database	An online biological database of information about the zebrafish (Danio rerio). zFIN gene names are used for Ensembl zebrafish genes, where available. https://zfin.org/

Upcoming Ensembl Platform Transition

Ensembl Glossary

Search for a term

Browse full list of terms

About Us

Get help

Our sister sites

Follow us