Deprecated

Compara gene and protein families were retired as of release 102, as announced in a November 2020 Ensembl blog post

Gene Families in Compara

Ensembl families are determined through classification of all Ensembl proteins, including multiple isoforms of the same gene, along with metazoan sequences from UniProt. It therefore provides a way of exploring orthologues and closely related homologues across a range of animal species.

Pipeline

The pipeline consists of the following steps:

  1. Load proteins from Ensembl and UniProt.
  2. Run an HMM search on the TreeFam HMM library to classify the sequences into their families.
  3. Align the families with Mafft.
  4. Annotate the family with a consensus description, based on its members' descriptions.

Family ID

The families have been assigned the stable ID of their corresponding HMM.

Consensus Annotation

For each cluster obtained, a consensus annotation is automatically generated from the UniProt description lines using the following approach:

  • If the description covers less than 40% of UniProt members in the cluster, the family description is assigned 'AMBIGUOUS'.
  • If the annotation confidence score, described below, is zero, 'UNKNOWN' is assigned.
  • Be aware that 'UNCHARACTERIZED' is a UniProt description for a protein, and does not reflect the score.

The annotation confidence score is the percentage of UniProt family members with this description, or part of it. Note that only family members with 'informative' UniProt descriptions are taken into account.

Multiple Alignments

Ensembl provides pre-calculated multiple sequence alignments of all members for each cluster. We provide a Wasabi viewer in the browser for viewing the alignments between just the Ensembl proteins, and the Ensembl and UniProt proteins together. You can also export a text file with the alignments of all the family members - a wide range of formats is available from the control panel.

Alternatively, export alignments using the Compara Perl API.