Ensembl Gene Set
Gene annotation provided by Ensembl includes both automatic annotation, i.e. genome-wide determination of transcripts, and manual curation, i.e. reviewed determination of transcripts on a case-by-case basis (in this case limited to some species such as human and mouse). Furthermore, Ensembl imports annotation from FlyBase, WormBase and SGD. Ensembl transcripts displayed on our website are products of the Ensembl automatic pipeline, termed the Ensembl genebuild. All Ensembl transcripts are based on experimental evidence and thus the automated pipeline relies on the mRNAs and protein sequences deposited into public databases from the scientific community. Manually-curated transcripts produced by the HAVANA group at the Welcome Trust Sanger Institute are displayed in a separate track on our website.
An Ensembl gene (with a unique ENSG... ID) includes any spliced transcripts (ENST...) with overlapping coding sequence. Transcripts from the Ensembl genebuild (see below), the Havana/Vega set and the Consensus Coding Sequence (CCDS) set may all be clustered into the same gene. Transcripts that belong to the same gene ID may differ in splice events, exons, and can give rise to very different proteins. These are isoforms, arising from alternative splicing. Transcript clusters with no overlapping coding sequence are annotated as separate genes. Two transcripts may overlap in non-coding sequence (ie. intronic sequence or UnTranslated Region (UTR), and be classified under two separate genes. After the Ensembl gene and transcript sequences are defined, the gene and transcript names (see below) are assigned.
Standard Ensembl genebuilds are performed on high-coverage genomes and usually take at least four months to complete. While the genebuild process is tailored to each species according to the data that are available, the main steps followed in a standard genebuild are the same for every species.
The initial set-up for a genebuild involves loading the assembly into an Ensembl schema datababase and then running several analyses across the genome including masking of repeats and ab initio gene predictions. The ab initio gene predictions are run for website display purposes only and are not used as supporting evidence for the gene structures displayed on our website.
The first stage of the genebuild is known as the Targetted stage. Here, species-specific proteins are aligned to the genome and Genewise is used to build a transcript structure for the protein on the genome.
The Targetted stage is followed by the Similarity stage in which proteins from closely related species are used to build transcript structure in regions were a Targetted transcript structure is absent. For those species having a lot of experimentally-generated protein sequences, the Targetted stage tends to contribute most of the gene structures in the genebuild. However, for species with fewer species-specific protein sequences the Similarity stage plays a much more important role in predicting gene structures.
The next stage in the genebuild is to align species-specific cDNA and EST sequences to the genome. Where cDNA alignments overlap transcripts predicted in the preceding stages, any non-translated region from the cDNA is spliced onto the transcript prediction as UTR. EST alignments are displayed on our website but are usually not used as supporting evidence in the genebuild.
The final set of gene predictions is obtained by merging identical transcripts built from different proteins sequences to produce multi-transcript gene predictions, each with a non-redundant set of transcripts models. For every transcript model, the protein and mRNA sequences used to predict the model is viewable in the browser as 'supporting evidence'. These can also be obtained using the Perl API.
Once a 'final' gene set has been obtained, a number of post-processing procedures are applied to filter and annotate the predicted genes. These include protein domain annotation, ncRNA annotation, pseudogene prediction and cross-referencing to external databases.
The human and mouse genebuilds involve additional steps not included in the standard Ensembl build. For both species, transcripts from the Consensus Coding Sequence (CCDS) set are imported directly and not altered by the genebuild process. In addition, where manual curation is available for a transcript, the Ensembl and HAVANA transcript models are compared. The Ensembl and HAVANA models are merged when they agree on the same coding sequence. Merged models will be coloured gold in the browser. A merged, or golden, gene indicates one or more common transcripts between Ensembl and HAVANA. This combined geneset is the default gene set from the GENCODE project.
The genebuild process for low-coverage genomes does not follow that of the standard Ensembl genebuild described above.
Sources of Supporting Evidence for a Transcript
Ensembl transcripts are based on mRNA and protein from the following databases. If you do not find a transcript that you expected in Ensembl, make sure there is sequence submitted into these databases. If the sequence is missing, consider submitting your experimental evidence to EMBL-Bank.
- EMBL Nucleotide Sequence Database is the European aspect of the International Nucleotide Sequence Database Consortium, INSDC, and is maintained at the European Bioinformatics Institute, the EBI, near Cambridge, UK. All sequence records are synchronised with the other partners in the INSDC, namely NCBI GenBank in North America and DDBJ in Japan.
- The UniProtKB houses a collection of reviewed, manually annotated protein sequences, UniProtKB/Swiss-Prot, and a set of unreviewed proteins translated from the EMBL/GenBank/DDBJ nucleotide set, UniProtKB/TrEMBL.
- NCBI RefSeq aims to provide a comprehensive set of mRNA and proteins. Manually annotated, reviewed proteins have an ID beginning with NP, known protein, while mRNAs in this category begin with NM. Predicted proteins and mRNA transcripts are not used in the Ensembl Genebuild, and begin with XP and XM, respectively.
Gene Names and External References
Most human genes have an associated HGNC symbol from the HUGO Gene Nomenclature Committee. If the gene or transcript comes from the Ensembl Genebuild's automated pipeline only, the HGNC symbol will have HGNC automatic associated with it. If it is a manually curated transcript from Havana/Vega, it will have HGNC curated associated with it.
'Clone-based' identifiers apply to transcripts that cannot be associated with an HGNC symbol and are either assigned by Ensembl or Havana, as above.
The list of gene name catagories for human is as follows:
- HGNC automatic
- HGNC curated
- Clone-based ensembl
- Clone-based vega
A new number follows each transcript name. If the number starts with '0', e.g. 001, 002, etc., this is a merged or manually curated transcript from Havana/Vega. If the number starts with '2', such as 200, 201,..., this is an automatically annotated transcript from Ensembl. This does not take the place of the Ensembl gene ID, ENSG..., which is stable from release to release, and has not changed. The goal is to be more consistent with the naming of our manually curated imports from Havana.
Ensembl genes and transcripts are classified as known, novel or merged.
A known gene or transcript matches to a sequence in a public, scientific database such as UniProtKB and NCBI RefSeq. The match must be for the same species, otherwise the gene is considered to be novel for that species. Ensembl-Havana merges occur when a manually annotated transcript from the Havana project matches to the Ensembl coding sequence exactly. Ensembl-Havana merged transcripts are limited to human and mouse. Matches to IDs in other databases, such as UniProtKB and NCBI EntrezGene, are listed under External References in the Transcript tab.
Known miRNAs must have a match in miRBase for the same species.
Ig segments, non-coding RNA and transient EST gene models are also automatically annotated by Ensembl. The genes based on ESTs, or Expressed Sequence Tags, can be a good starting point to determine new isoforms. However, these models are not as robust as the known and novel protein-coding set, as ESTs have a high degree of error in their sequence. This may not be the case for some human sets, but in the majority, the mRNA and protein sequences used in the main gene set determination are more reliable. EST genes have 'EST' in the identifier, such as 'ENSESTG' for a human Ensembl EST gene. They can also be used to identify the presence of mRNA, as a confirmation that a transcript is expressed.
Transcripts are scanned for pseudogene behaviour or false coding sequence. A transcript is considered a pseudogene if 1 of these 4 criteria are met:
- It is a single exon transcript and it matches a multi-exon transcript elsewhere in the genome.
- The transcript is completely masked out by repeat masker.
- The transcript contains no introns and multiple frameshifts.
- The transcript contains frameshifts and all of the introns are >80% covered by repeats.