The orthology QC is generated by 2 pipelines that focus on different levels of genetic difference: nucleotide-level mutations and larger-scale rearrangements. Both approaches are completely independent from the inference of orthologies itself and are used as external evidence to confirm them.
Gene Order Conservation score
The pipeline that focuses on large-scale rearrangements generates the gene order conservation (GOC) scores. This pipeline makes the assumption that rearrangements are likely to happen to a group of contiguous genes, rather than genes in isolation - hence we expect that the gene neighbourhoods surrounding each gene of the pair would be well conserved.The GOC pipeline can be divided into the following steps:
- Load all the predicted orthologs for a pair of species
- Separate the orthologs into their respective chromosomes
- Discard any ortholog that is by itself (usually in a scaffold). As these orthologs automatically get a NULL score for having no neighbours
- Order the set of orthologs in each chromosome by their start positions on the chosen reference genome
- For each orthologous pair, fetch the two genes upstream and downstream
- Check whether they are also identified as orthologues and in the same orientation
- Each match is scored as 25% meaning if all four neighbouring genes match that ortholog gets a GOC score of 100% for this reference genome
- Go back to step 4 and repeat using the alternative species as the reference genome
- Now we have 2 GOC scores for each other. We currently report the max of these scores
Of the 4 neighbouring genes, 3 are orthologues and in conserved order and position, resulting in a GOC score of 75.
Whole Genome Alignment score
The pipeline that focuses on the nucleotide level mutations generates the whole genome alignment (WGA) score. It assumes that high-quality “true” orthologs should be well aligned to each other. This approach is based on the nucleotide level differences, taking advantage of the wealth of alignment information available through EnsEMBL’s comparative genetics analyses.The main steps of this WGA pipeline are:
- First, as the coverage over exonic regions is taken into account by this pipeline, exon boundaries are fetched for all genes in all species of interest. These are stored in a local table for efficiency gains later
- Next, the species are paired off and all alignments between each pair are detected. All predicted orthologs between the pair are fetched and batched (default = 10)
- The coverage over each member of the orthology is calculated using every available alignment. Coverage over exons is regarded as a higher importance than intronic regions, so a weighted score is generated. The score takes the coverage over exons as a base, with bonus points given for coverage over the introns (normalized by the proportion of intronic sequence in the gene). Scores for every alignment over both genes is stored in an intermediate table
- Finally, an overall score for the homology prediction, as a whole is computed. This can be defined as the maximum score, after the score for the pair of genes has been averaged for each alignment i.e. we report the average score for the greatest-coverage alignment
We have defined for several taxonomic groups stringent thresholds to flag orthologies as high-confidence. For each pair of species, we select the thresholds corresponding to their most-recent common ancestor. To be flagged as high-confidence, a pair of orthologue must have a sufficient %identity, and have a GOC score or WGA coverage greater or equal to the threshold. If none of the latter two metrics are available for this pair of species, we fall back to the tree-compliance metric explained in this document.
|Clades||Min. GOC score||Min. WGA score||Min. %identity|
|Mammalia, Aves, Percomorpha||75||75||50|
|Others||No threshold used||No threshold used||25|