EnsemblEnsembl Home

The Ensembl Annotation Process

Genome assemblies

The Genome Assemblies page gives more information on where we get our genome assemblies from, how the sequence data for these genome assemblies are structured, and how we represent these data in Ensembl.

Protein-coding gene annotation

Protein-coding genes are automatically annotated using Ensembl's genebuild pipeline. All transcripts are based on mRNA and proteins in public scientific databases.

The human gene set is used as the GENCODE gene set. The human and mouse gene sets include all CCDS transcripts.

See the annotation article for more about the Ensembl genebuild pipeline, gene names and annotation.

Low-coverage genomes are annotated using a modified pipeline which attempts to locate genes across multiple scaffolds.

More genes

The Ensembl gene set also includes automatically-annotated pseudogenes and non-coding RNAs. For human and mouse, we include annotation from IMGT for Ig genes.

EST-based genes are predicted and displayed on the website but are not included in the final gene set.

Paired-end Illumina RNA-seq data have been used to generate transcript models for many species including human, zebrafish and pig.