EnsemblEnsembl Home

Comparative Genomics

Ensembl Compara provides cross-species resources and analyses, at both the sequence level and the gene level. Below is a list of those resources.


Gene-based resources

We automatically integrate the gene annotations produced by the Ensembl genebuild team across all the species.

Phylogenetic trees

We compute phylogenetic trees across the whole set of protein-coding genes with one pipeline, and ncRNA genes with another. They both result in a set of trees that are visualized and accessed the exact same way. From both set of gene trees, we extract homologues (orthologues and paralogues). We also analyze the gene gain and loss events using the CAFE software.

  • Protein trees are constructed using a representative protein for every gene in Ensembl: proteins are clustered using hcluster_sg based on NCBI BLAST+ e-values, and each cluster of proteins is aligned using M-Coffee or Mafft. Finally, TreeBeST is used to produce a gene tree from each multiple alignment, reconciling it with the species tree to call duplication events. More information → Tree statistics →
  • ncRNA trees are constructed using gene families represented in RFAM, for which a specific covariance model is provided. For each gene family, we build several trees using secondary structure alignments with INFERNAL and genomic alignments with PRANK. All the trees are merged into a final tree using TreeBeST. More information → Tree statistics →

Ensembl Families

We also extend our gene-based resources to the whole set of Metazoan proteins from UniProtKB SwissProt and SPTREMBL in a resource named Families. Briefly, the pipeline builds similarity clusters with MCL on the set of all Ensembl proteins (potentially several per gene), and the above-mentionned set of UniProt proteins. Clusters are then aligned with Mafft. More information →

Stable ID mapping

Ensembl Families and Protein Trees undergo a step of stable IDs mapping, that allows one to track the update of a tree or a family across releases. Please note that the mapping exclusively relates to the content, and not to the actual conservation of the alignment or tree topology. More information →.
ncRNA trees can be naturally mapped across releases using their RFAM identifier.



Sequence-based resources

Whole genome alignments

Sometimes abbreviated as WGA, they are performed either pairwise between two species, or using multiple species. Pairwise alignments are based on lastZ-net (although we have not recomputed all the previous BlastZ-net and translated-Blat alignments). Multiple alignments are mainly based on the EPO pipeline (extended to include fragmented genomes) and Mercator/Pecan. More information →

Following is the list of additional analysis that are applied on the whole-genome alignments:

Ancestral sequences

From the multiple alignments performed with the EPO pipeline, we can predict ancestral sequences for a number of ancestral taxa. More information →

Age of Base (Beta)

In turn, from these ancestral events, we estimate when mutation events occurred along the species tree. This experimental track is currently computed only for substitutions along the human lineage. More information →

Conservation scores and constrained elements

On the multiple alignments performed with the EPO and EPO-2X pipelines, we run GERP to compute conservation scores and extract regions that are significantly conserved. More information →

Syntenies

Finally, we can derive synteny mappings from the pairwise alignments of species whose genome assembly is not too fragmented. More information →



Access

Data can be accessed using the Compara Perl API, BioMart, or comparative genomics pages on the browser. Gene trees can be viewed from any 'Gene' page on the browser, and exported via the control panel and the Jalview plug-in in the pop-ups that appear when clicking on any part of the tree.

The external Java-based tool PhyloWidget can also be used to visualise phylogenetic trees of compara species. An example which includes all the current species for the main Ensembl website has been created by the Compara team:

  • Ensembl species tree (requires Java)
    At UCSC and Ensembl, we have agreed to use the same species tree for the alignments. However, we do not work exactly with the same set of species (you can find more information here). The branch lengths used in Ensembl are based on a mixture of an estimation based on 4D sites and an estimation based on million years since the divergence of two species (the data being taken from the TimeTree database).

Please refer to the Compara FAQs or the Ensembl Helpdesk if you have further questions.