When is new data (eg new dbSNP build, a genome assembly, or updated ENCODE data) released on Ensembl ?
At Ensembl, maintaining high standards is a priority; we want people to be able to trust the data on our website and underlying databases. For this reason, the process of annotation takes some time, varying according to the type and quality of the data. We carry out stringent quality control both on the raw input data that we receive and on the data that we output, with continual further checks throughout the process in our release cycle. Because of this, our annotation takes longer than if we had fewer checks, and there may be a time delay in the release of data. We believe it is more important to produce high quality data slowly, than lower quality data quickly.
When the Ensembl Genebuild team receives a new genome assembly (which must be submitted to the INSDC and have passed their QC process), it takes two to three months to produce a set of gene models using the Ensembl gene annotation system. For all of our 65+ represented species, the Ensembl genebuild uses proteins from UniProt, and in recent genebuilds we have restricted this to those that have direct evidence (PE 1 and 2; http://www.uniprot.org/manual/protein_existence). Where annotated protein-cDNA pairs are available, we use Exonerate's cdna2genome module to produce protein coding models with UTR. For all species, same-species proteins are prioritised as an evidence source for gene annotation. For species where there are few proteins or cDNAs available in the public databases, RNAseq data are useful for gene annotation.
For our most used species (human and mouse), Havana manual curation and CCDS consensus sequences are merged into the Ensembl gene set, and given Ensembl (ENST or ENSMUST) IDs, resulting in our highest confidence transcripts (Havana/Ensembl merged transcripts are gold in our browser). Havana manual curation is also available in the Ensembl gene set for zebrafish, rat and pig. There is more information on our genebuild process on our genome annotation page, as well as species-specific genebuild information, accessible from some of the species homepages, such as human.
Imports from dbSNP are assessed through rigorous quality checking by the Ensembl Variation team. Suspect data is flagged, and available for users as failed variants. Variants are then classified according to their position on the genome and their consequences on the associated genes (consequence types). For human variants, linkage disequilibrium is calculated by population. This process can take a couple of weeks. Read more on our variation data description page.
Ensembl Regulation and ENCODE
When the Ensembl Regulation team receives data (ie from ENCODE or the Roadmap Epigenomics), it is integrated and summarised in the regulatory build. This adds value for each regulatory element by providing an aggregate MultiCell summary, transcription factor binding motifs and more specific classifications across each of the supported cell types. This is presented alongside the underlying data and other complementary datasets, such as ENCODE segmentation, providing an easily accessible interface that summarises the regulatory status across the genome.
The majority of comparative genomics data is created in house, and is updated with every release. This unique data is not dependent on import from other databases, and is subject to internal quality checks. Respected alignment algorithms like Pecan, for whole genome alignments, or TreeBeST, for phylogenetic inference, are used in our pipelines - read more here.
New Ensembl releases come out every three months and can include updated datasets, such as dbSNP imports and new genebuilds. For more about the release cycle, coordinated by the Ensembl Production Team, read this article. Despite all our efforts, some erroneous data does slip through the net, in part reflecting issues with the underlying data. We are grateful to any users who spot errors and report them to firstname.lastname@example.org.