EnsemblEnsembl Home

Stable Identifier Version Increments in Ensembl

Version Assignment

The following is an explanation of the assignment of versions in Ensembl’s stable identifier mapping pipeline and so should be used as a guide for any species assigned an ENS ID prefix. We assume that an old and new entity have been brought together due to their high level of scoring, a function of their similarity at their location, sequence or composition, and therefore a stable ID version increment is applicable. Since exons are the only entity actively mapped all other entities are based on these exon mappings.

We have provided a brief explanation of the rules applied with appropriate examples. Data structures are expressed in JSON.

Exons

Exons have new versions incremented when their underlying genomic sequence has differed between the old and new release. Sequence edits are applied at the transcript and translation level and therefore are not taken into account.

No Change in Location and Sequence: No Increment

{ "id":"old", "start" : 1000, "end" : 2000, "seq" : "ATGG...GTA" }

{ "id":"new", "start" : 1000, "end" : 2000, "seq" : "ATGG...GTA" }

Locations are the same therefore we will not assign a new identifier nor increment the version since both sequences are the same.

Location Change but no Sequence Change: No Increment

{ "id":"old", "start" : 1000, "end" : 2000, "seq" : "ATGG...GTA" }

{ "id":"new", "start" : 1100, "end" : 2100, "seq" : "ATGG...GTA" }

Locations overlap sufficiently, so the two entities are assumed to be the same object. We will not increment the version since sequences are the same.

Location Change and a small Sequence Change: Increment

{ "id":"old", "start" : 1000, "end" : 2000, "seq" : "ATGG...GTA" }

{ "id":"new", "start" : 1100, "end" : 2100, "seq" : "TTGG...GTA" }

Locations overlap sufficiently, so the two entities are assumed to be the same object. Version is incremented due to the starting A changing to a T.

No Location Change and a small Sequence Change: Increment

{ "id":"old", "start" : 1000, "end" : 2000, "seq" : "ATGG...GTA" }

{ "id":"new", "start" : 1000, "end" : 2000, "seq" : "TTGG...GTA" }

No location change so objects are the same but a change in the starting base from A to T means a version increment.

Transcripts

Transcripts are versioned based upon their spliced sequence which does take mRNA edits into account. An MD5 checksum of the spliced sequence is calculated when a minimal representation of a transcript is built. The checksum is employed as a memory conservation measure. When two transcripts are compared and deemed to be equal, a function of the exons they have in common, the checksums are compared. Any difference in these checksums will result in a version increment.

No Change in exons: No Increment

{ 
    "id" : "old_trans",
    "checksum" : "16429b058f6973d32ae95716ec16a1de",
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACT...TAA"}
    ]
}

{ 
    "id" : "new_trans",
    "checksum" : "16429b058f6973d32ae95716ec16a1de",
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACT...TAA"}
    ]
}

Due to the underlying structure of the exons the two transcripts are assumed to be the same. The MD5 checksums are the same no version increment occurs.

Small Change in exons: Increment

{ 
    "id" : "old_trans",
    "checksum" : "16429b058f6973d32ae95716ec16a1de",
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACT...TAA"}
    ]
}

{ 
    "id" : "new_trans",
    "checksum" : "945504179627b5fef135e5158003e421",
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACC...TAA"}
    ]
}

Due to the underlying structure of the exons the two transcripts are assumed to be the same apart from a single basepair difference in the second exon (ACC rather than ACT). Checksums do not match therefore we increment the version.

Small Change in exons with Sequence Edit Compensation: No Increment

{ 
    "id" : "old_trans",
    "checksum" : "16429b058f6973d32ae95716ec16a1de",
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACT...TAA"}
    ]
}

{ 
    "id" : "new_trans",
    "checksum" : "16429b058f6973d32ae95716ec16a1de",
    "seq_edit" : "302>T"
    "exons" : [ 
        {"id":"old1", "start" : 100, "end" : 199, "seq" : "ATG...GTA"}, 
        {"id":"old2", "start" : 300, "end" : 399, "seq" : "ACC...TAA"}
    ]
}

Due to the underlying structure of the exons the two transcripts are assumed to be the same apart from a single base pair difference in the second exon (ACC rather than ACT). A sequence edit will change this back to a T therefore no change in the MD5 checksum and no version increment.

Translations

Translations work in a similar way to transcripts except they use the full length amino acid sequence rather than a MD5 digest of the sequence. Once again the sequence is subject to mRNA edits and post-translational edits such as selenocysteine. A change in the transcript version which produces a translation does not mean an increment in the version of the peptide if the transcript change was synonymous e.g. AAT and AAC encode for Asparagine. Translations in Ensembl annotated species have a 1:1 relationship to a transcript therefore we will not consider the issue of how these entities are assumed to be similar.

No Change in Spliced Sequence: No Increment

{ 
    "id" : "old_translation",
    "seq" : "MVTK"
}

{ 
    "id" : "new_translation",
    "seq" : "MVTK",
}

Since there is no change in the amino acid sequence the version stays the same.

Synonymous Change in the Spliced Sequence: No Increment

{ 
    "id" : "old_translation",
    "seq" : "MVTK",
    "spliced_seq" : "ATGGTCACAAAG"
}

{ 
    "id" : "new_translation",
    "seq" : "MVTK",
    "spliced_seq" : "ATGGTCACAAAA"
}

The final base was changed from a G to an A. However both the codons AAG and AAA code for Lysine so there is no version change.

Non-Synonymous Change in the Spliced Sequence: Increment

{ 
    "id" : "old_translation",
    "seq" : "MVTK",
    "spliced_seq" : "ATGGTCACAAAG"
}

{ 
    "id" : "new_translation",
    "seq" : "MVTN",
    "spliced_seq" : "ATGGTCACAAAC"
}

The final base was changed from a G to a C. Codons AAG and AAC produce Lysine and Asparagine respectively so we increment the version.

Non-Synonymous Change in the Spliced Sequence but Compensated: No Increment

{ 
    "id" : "old_translation",
    "seq" : "MVTK",
    "spliced_seq" : "ATGGTCACAAAG"
}

{ 
    "id" : "new_translation",
    "seq" : "MVTK",
    "seq_edit" : "4>K",
    "spliced_seq" : "ATGGTCACAAAC"
}

The final base was changed from a G to a C. Codons AAG and AAC produce Lysine and Asparagine respectively but an amino acid sequence edit reverts the sequence back to MVTK therefore no version change.

Genes

Genes have no splice model available and therefore nothing to directly version against. Instead a gene’s version relates to its ability to retain all transcript models, including versions, remaining stable. The stable identifiers are sorted lexicographically and then concatenated together meaning we treat the grouping of transcripts as a Set). This makes genes more susceptible to version increments but resistant to random changes in transcript ordering.

Gene with no Coding Model Changes: No Increment

{
    "id" : "old",
    "transcripts" : [ { "stable_id" : "ENS1", "version" : 1}, { "stable_id" : "ENS2", "version" : 1} ]
}

{
    "id" : "new",
    "transcripts" : [ { "stable_id" : "ENS1", "version" : 1}, { "stable_id" : "ENS2", "version" : 1} ]
}

The transcripts held by the two genes are the same; this results in no version increment.

Gene with a Coding Model Version Change: Increment

{
    "id" : "old",
    "transcripts" : [ { "stable_id" : "ENS1", "version" : 1}, { "stable_id" : "ENS2", "version" : 1} ]
}

{
    "id" : "new",
    "transcripts" : [ { "stable_id" : "ENS1", "version" : 1}, { "stable_id" : "ENS2", "version" : 2} ]
}

Here the transcript ENS2 has had a version increment. This will result in a version increment for the gene.

Gene with a Coding Model ID Change: Increment

{
    "id" : "old",
    "transcripts" : [ { "stable_id" : "ENS1", "version" : 1}, { "stable_id" : "ENS2", "version" : 1} ]
}

{
    "id" : "new",
    "transcripts" : [ { "stable_id" : "ENS3", "version" : 1}, { "stable_id" : "ENS2", "version" : 1} ]
}

Here the first transcript for this gene has changed its stable ID from ENS1 to ENS3. This will result in a version increment for the gene as we no longer have the transcript ENS1.