EnsemblEnsembl Home

News for Ensembl Release 81 (July 2015)

News categories

New web displays and tools

Trackhub settings support (all species)

In this release we have added support for trackhub visibility settings; in other words, tracks that are turned on by default in the hub's trackDb.txt file should automatically be shown in Ensembl.

The only exception is for hubs that we configure internally, such as the Genome Reference Consortium GRIT hub. For these, all tracks will be hidden by default, but you can find them by searching in the Control Panel.

Transcript sequence markup (all species)

Transcript sequences can now be marked up to show exons as alternating upper and lower case characters, rather than grey/blue text. Simply check the "Show exons as alternating upper/lower case" box in the "Configure this page" panel on Transcript cDNA or Transcript Protein pages.

This markup option will also carry over to the sequence export if RTF format is chosen.

New species, assemblies and genebuilds

Update to Ensembl-Havana human GENCODE gene set (release 23) (Human)

Updated Ensembl-Havana gene set (GENCODE release 23). This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

The human GRCh38.p3 gene annotation is also included:

The patches for GRCh38.p3 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence. Annotation of the patches is stored in the core database.

Mouse: assembly updated to GRCm38.p4 (Mouse)

The mouse genome assembly was updated to GRCm38.p4 and the assembly information in all mouse databases has been altered accordingly. This minor assembly update contains 30 assembly patches. The DNA sequence for the primary assembly (chromosomes, unlocalized scaffolds and unplaced scaffolds) remains unchanged.

Vega Mouse annotation updated (Mouse)

Manual annotation of mouse from Havana has been updated and contains the data released in Vega 61

Vega Human annotation updated (Human)

Manual annotation of human from Havana has been updated and contains the data released in Vega 61

New variation data

New dbSNP import for cow (Cow)

dbSNP143 data has been imported for cow

New regulation data

Mouse regulatory Build update (Mouse)

The Regulatory Build on Mouse was re-computed, converting the "old style" build to the "new style" build, as was done on human in E!76. All Regulatory Builds in Ensembl are now updated to the new style.

We took the opportunity to increase the number of cell types to 8.

API and schema changes

Schema update: Storing sample data (all species)

We updated the way how we store populations, individuals and samples. With the updated schema we can store samples for an individual. All genotypes and read coverage data will be stored on the sample level.

New table:

  • sample

Rename tables:

  • individual_population to sample_population
  • individual_genotype_multiple_bp to sample_genotype_multiple_bp

The change is reflected in tables that store individual data. Individual_id is changed to sample_id in the following tables:

  • compressed_genotype_region
  • read_coverage
  • structural_variation_sample

Update columns in individual table: display, has_coverage and variation_set_id columns moved into the new sample table and have been deleted from the individual table.

API support for the new sample schema (all species)

We updated our API to work with the new sample schema.

Add new modules for representing, creating and storing sample objects.:

  • Sample.pm and SampleAdaptor.pm

Rename modules:

  • IndividualGenotype.pm to SampleGenotype.pm
  • IndividualGenotypeFeature.pm to SampleGenotypeFeature.pm
  • IndividualGenotypeAdaptor.pm to SampleGenotypeAdaptor.pm
  • IndividualGenotypeFeatureAdaptor.pm to SampleGenotypeFeatureAdaptor.pm

Updated variable names from individual to sample in almost all modules in the variation API. 

Updated scripts and pipelines.

Our test suite has been updated accordingly.

Other updates

Compara

Schema version update (all species)

80 -> 81

Pairwise alignments: human and mouse patches (all species)

  • Human is patched to GRCh38.p3, so

    • we will upload human_ref-to-human_patches alignments done by Genebuilders 
    • we will run human_patches-to-high_coverage_species lastz alignments

  • Mouse is patched to GRCm38.p4, so

    • we will upload mouse_ref-to-mouse_patches alignments done by Genebuilders 
    • we will run mouse_patches-to-high_coverage_species lastz alignments

ProteinTrees and homologies (all species)

GeneTrees (protein-coding) with new/updated genebuilds and assemblies

 -- all-vs-all blastp (ncbi-blast-2.2.30+)
 -- Clustering using hcluster_sg
 -- Multiple sequence alignments using MCoffee (Version_9.03.r1318) or Mafft (mafft-7.221)
 -- Phylogenetic reconstruction using TreeBeST
 -- Homology inference
 -- Pairwise gene-based dN/dS scores for high coverage species pairs only (both on orthologues and paralogues) (codeml/PAML v4.3)
 -- GeneTree stable ID mapping
 -- Per family gene dynamics using CAFE (v2.2)

Protein Families (all species)

Updated MCL families including all Ensembl transcript isoforms (including human non-reference haplotypes) and newest Uniprot Metazoa.

 -- Getting distances by NCBI BlastP (v.2.2.30+)
 -- Clustering by MCL (v.14-137)
 -- Multiple Sequence Alignments with MAFFT (v.7.221)
 -- Family stable ID mapping

ncRNAtrees and homologies (all species)

Classification based on Rfam models (v11.0)
Multiple sequence alignments with Infernal
Phylogenetic reconstruction using RAxML
Phylogenetic reconstruction using FastTree2 and RAxML-Light for very big families
Additional multiple sequence alignments with Prank (w/ genomic flanks)
Additional phylogenetic reconstruction using PhyML and NJ
Phylogenetic tree merging using TreeBeST
Per family gene dynamics using CAFE
Homology inference
Secondary structure plots

Compara dumps (all species)

  • EMF / Fasta / OrthoXML / PhyloXML dumps for ProteinTrees + PhyloXML dumps for CAFE ProteinTrees
  • EMF / Fasta / OrthoXML / PhyloXML dumps for ncRNAtrees + PhyloXML dumps for CAFE ncRNAtrees

Core

Ensembl VM Build (all species)

The Ensembl Virtual Machine applicance will be updated to version 81.

External database references update (multiple species)

Xrefs update for:

homo_sapiens (human), mus_musculus (mouse), otolemur_garnetii (mouse lemur), sorex_araneus (shrew), oryzias_latipes (medaka), pan_troglodytes (chimp), petromyzon_marinus (lamprey), taeniopygia_guttata (zebrafinch), ovis_aries (sheep), canis_familiaris (dog), ciona_intestinalis (sea squirt), anolis_carolinensis (anole lizard), chlorocebus_sabaeus (vervet monkey), ornithorhynchus_anatinus (platypus)

LRG Import (all species)

Importing the latest version of Locus Reference Genomic dataset

patch_80_81_a.sql - schema_version update (all species)

Update schema_version in meta table to 81.

patch_80_81_a.sql - schema_version update in ontology db (all species)

Update schema_version in meta table to 81.

patch_80_81a.sql - schema_version update in production db (all species)

Update schema_version in production database to 81.

Stable ID lookup (all species)

Stable ID lookup provided for REST services

Includes lookup for RefSeq and CCDS entries

UTR features in the REST API (all species)

UTR features can be retrieved using the REST API

Band information in the REST API (all species)

Band information can be retrieved via the overlap endpoint in the REST API

New UTR, CDS and ExonTranscript features (all species)

The Ensembl API supports the retrieval of UTR, CDS and ExonTranscript features

UTR features represent the non-coding exons of a transcript, CDS features represent the coding exons of a transcript

ExonTranscript features are Exons which retain the link to their parent transcript as well as their rank

GFF3 dumps (all species)

Ensembl gene annotation will be provided in GFF3 files, along with the already existing GTF files

Regulation

patch_80_81_c Drop experiment.date (all species)

The experiment.date field has been dropped from the funcgen schema and the related Bio::EnsEMBL::Funcgen::Experiment::date method has been deprecated.

Collection Files replaced by BigWigs (all species)

The Funcgen density maps, which were previously stored in an in-house flat file format (Collection Files) were ported to the more common standard, BigWig.

The Funcgen API now provides only a path to the files, which can be read using the Ensembl::IO repository. 

patch_80_81_a.sql - schema_version update (all species)

patch_80_81_a.sql - schema_version update

patch_80_81_b.sql|add gender 'mixed' to table cell_type (all species)

patch_80_81_b.sql|add gender 'mixed' to table cell_type 

Genebuild

Remove Amazon molly empty transcripts (Amazon molly)

There are some empty transcripts for Amazon Molly that need to be removed

Amazon Molly:Fix genscan predictions (Amazon molly)

There are some problems with Amazon molly Genscan predictions that prevent the annotation from being dumped. These need to be fixed.

Human: GRCh38.p3 Karyotype Bands (Human)

Karyotype bands were updated in regions overlapping patches

Updated human otherfeatures db: New CCDS import (Human)

This release of the human gene set also includes 31,359 transcript models as part of an updated version (June 2015) of CCDS

Zebrafish: import clone data (Zebrafish)

Zebrafish clones were imported from the NCBI clone database. The tracks for the clones can be found under "Clones and misc regions" in the configuration menu, while the coordinates for the BAC ends can be found as tracks under "Simple features", also in the configuration menu.

Mouse: GRCm38.p4 Karyotype Bands (Mouse)

Karyotype bands were updated in regions overlapping patches

Update to Ensembl-Havana mouse GENCODE gene set (Mouse)

Updated Ensembl-Havana mouse gene set. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

The mouse GRCm38.p4 gene annotation is also included:

The patches for GRCm38.p4 were annotated using a combination of annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence. Annotation of the patches is stored in the core database.

Mouse: updated cDNA alignments (Mouse)

A new cdna database was created for e80: The latest set of cDNAs for mouse (as of Month 2015) from the European Nucleotide Archive and NCBI RefSeq (release nn) were aligned to the current genome using Exonerate.

Updated mouse otherfeatures db: New CCDS import (Mouse)

This release of the mouse gene set also includes 23,830 transcript models as part of an updated version (May 2015) of CCDS

Correction of DMD transcript in Dog (Dog)

A transcript in the DMD gene is missing an exon, we will fix the transcript

Human: updated RefSeq gene import (Human)

The imported RefSeq gene set was updated in the human otherfeatures database. Please note that RefSeq annotates gene models on cDNA sequence and not on the reference genome, meaning that when users choose to translate the RefSeq transcripts off the reference genome that the translations may contain stop codons.

Mouse: updated RefSeq gene import (Mouse)

The imported RefSeq gene set was updated in the mouse otherfeatures database. Please note that RefSeq annotates gene models on cDNA sequence and not on the reference genome, meaning that when users choose to translate the RefSeq transcripts off the reference genome that the translations may contain stop codons.

Human: RefSeq-to-Ensembl model comparison (Human)

For each refseq_import transcript model present in the human otherfeatures db, a comparison is carried out with all overlapping Ensembl transcript models from the core db.

Initially the models are compared on the whole transcript level, all exons are compared in terms of genomic coordinates and the transcript sequences of the two models are also compared.

For non-coding models, if both of these comparisons match then the models are considered to match on the whole transcript level and the RefSeq model is given an attribute to say there is a match on the whole transcript level. If no overlapping Ensembl model meets the criteria the RefSeq model is given a transcript attribute to denote this.

For models where a CDS is defined there is an extra level of comparison. The coding exon coordinates, CDS and translation sequences of both models are also compared.

If all exons coordinates (coding and non-coding) and the transcript, CDS and the translation sequences all match then the RefSeq model is given an attribute to say there is a match on the whole transcript level.

Failing this, a comparison is done on the coding exons coordinates, CDS and translation sequences only. If the comparisons now match, the RefSeq transcript is given an attribute to denote that there is a match on the CDS level only.

If there are still no matching Ensembl transcripts at this point the RefSeq transcript is given an attribute to denote that there is no matching Ensembl model.

All matching Ensembl models have their stable ids listed in the value field of the corresponding transcript attribute for the RefSeq model.

Human: transcript attributes for Refseq-genomic-to-mRNA comparison (Human)

Transcript attributes will be added for the refseq_import geneset in the human otherfeatures db. Each refseq_import transcript will have an attribute to denote whether the genomic sequence that the transcript covers matches the mRNA sequence that the transcript is based on (the sequences present in the RefSeq mRNA file).

A prefect match is denoted as an alignment across the entirety of both sequences that contains no mismatches or indels. If initially there is a mismatch, the RefSeq mRNA will go through polyA clipping and the sequences will be compared again to see if a perfect match is possible post polyA clipping.

Transcripts that do not have a perfect match between the mRNA and the genomic sequence will get additional attributes to define what regions (5' UTR, CDS, 3' UTR, or 'whole transcript' if there is no CDS defined) do not align perfectly, along with a summary of the information in the alignment (match,mismatch, indel count, total indel length).

Mouse clone import (Mouse)

Mouse clone libraries have been imported from the NCBI clone database. to replace previous DAS tracks. The tracks for the clones can be found under "Clones and misc regions" in the configuration menu, while the coordinates for the BAC ends can be found as tracks under "Simple features", also in the configuration menu.

Human: updated cDNA alignments (Human)

A new cdna database was created for e80: The latest set of cDNAs for human (as of Month 2015) from the European Nucleotide Archive and NCBI RefSeq (release nn) were aligned to the current genome using Exonerate.

Upgrade remaining species to rnaseq matrix (all species)

For some species we have RNASeq data but have not yet displayed options in an RNASeq matrix for the users. This requires changes to the analysis_description, analysis_web_data and web_data tables in the ensembl_production database

Human: assembly updated to GRCh38.p3 (Human)

The human genome assembly was updated to GRCh38.p3 and the assembly information in all human databases has been altered accordingly. The DNA sequence for the primary assembly (chromosomes, unlocalized scaffolds and unplaced scaffolds) remains unchanged.

Production

EMBL and Genbank Dumps (all species)

EMBL and Genbank dumps for all species.

Ensembl 81 mart databases (all species)

  • Ensembl Genes 81
    • Human assembly updated from GRCh38.p2 to GRCh38.p3
    • Mouse assembly updated from GRCm38.p3 to GRCm38.p4
  • Ensembl Variation 81
    • Human assembly updated from GRCh38.p2 to GRCh38.p3
    • Mouse assembly updated from GRCm38.p3 to GRCm38.p4
    • Added new structural variation species Sheep (Ovis aries)
  • Ensembl Regulation 81
    • New mouse regulation build data
  • Vega 61
    • Human assembly updated from GRCh38.p2 to GRCh38.p3
    • Mouse assembly updated from GRCm38.p3 to GRCm38.p4

External reference projection (all species)

Gene ontology (GO) identifiers and gene name projection to all species.

FASTA & GTF dumps (all species)

FASTA & GTF dumps for all the species

Variation

Phenotype data updates (multiple species)

  • Human phenotype data has been updated from different sources including NHGRI-EBI GWAS, OMIM, ClinVar, UniProt and Decipher.
  • OMIA data for Cow, Dog, Horse, Sheep
  • RGD data for Rat
  • AnimalQTL for Cow, Horse, Chicken, Pig, Sheep
  • IMPC data (release 3.1) for Mouse

Structural variations (Zebrafish, Human, Mouse, Sheep)

  • Added new studies and updated other studies from DGVa

Web

replace i icon with ? icon (all species)

The "i" icon when clicked gives you the help/documentation page has been replaced with a new "?" icon.

The "i" icon in the tracks configuration has been left unchanged as this is more information rather than help.

Public plugins sqlite and sge_blast removed (all species)

The said two plugins, one for SQLite support for user db and other for SGE BLAST, were outdated and have been removed from the public-plugins repository.

Retirement of archives 67 and 59 (all species)

This release cycle we will be retiring archive 67 (May 2012) in accordance with our three-year rolling retirement policy. Due to the arrival of GRCz10 in Ensembl 80 we will also be retiring archive 59 (Aug 2010) which currently shows Zebrafish Zv8 annotation.The data will remain available on our public database server; only the web interfaces will be removed.

User accounts/session database configuration changed (all species)

Database used for user accounts and sesison records was configured using the conf/SiteDefs.pm configurations as below:

$SiteDefs::ENSEMBL_USERDB_TYPE = 'mysql';
$SiteDefs::ENSEMBL_USERDB_NAME = 'ensembl_accounts';
$SiteDefs::ENSEMBL_USERDB_USER = 'mysqluser';
$SiteDefs::ENSEMBL_USERDB_HOST = 'localhost';
$SiteDefs::ENSEMBL_USERDB_PORT = 3306;
$SiteDefs::ENSEMBL_USERDB_PASS = '';

These configurations have been removed and now the database is configured by adding configurations in conf/ini-file/MULTI.ini as below:

[databases]
DATABASE_ACCOUNTS = ensembl_accounts
DATABASE_SESSION = ensembl_accounts

[DATABASE_ACCOUNTS]
HOST = localhost
PORT = 3306
USER = mysqluser
PASS =

[DATABASE_SESSION]
HOST = localhost
PORT = 3306
USER = mysqluser
PASS =

Changes to the code can be seen here:

Future Plans

Read about our future plans on our blog!