EnsemblEnsembl Home

News for Ensembl Release 88 (March 2017)

News categories

New web displays and tools

Masthead redesign (all species)

As we begin to incorporate strains and other non-reference assemblies into Ensembl, we want to display more information about a "species" on all relevant pages so that users are aware of the differences.

We have therefore rearranged the top of our browser pages, removing the species "tab" and putting the information in the dark blue area above the tabs where there is more space. The dropdown list of species is still accessible via the triangle icon, and all other functionality is unaffected.

Find a Data Display (all species)

Ensembl has lots of different pages where you can access data or see visualisations of that data. Now, to help you find the display you need, we have introduced our "Find a data display" page. Just type a gene identifier, location or RSID into the form to see a list of relevant views, with thumbnails of sample data and links to that page for your feature or region of choice.

The page can be accessed from the central panel on the home page or via the 'Help and Documentation' link in the page header - in the latter, the link is near the top of the lefthand menu.

New species, assemblies and genebuilds

Update to Ensembl-Havana human GENCODE gene set (release 26) (Human)

Updated Ensembl-Havana gene set (GENCODE release 26). This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

The human GRCh38.p10 gene annotation is also included:

The patches for GRCh38.p10 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence. Annotation of the patches is stored in the core database.

Mouse: update to Ensembl-Havana GENCODE gene set (Mouse)

Updated Ensembl-Havana mouse gene set. This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

Rat: gene set update (Rat)

The rat gene set will be updated to include the latest manual annotation from the Havana team.

Vega Mouse annotation updated (Mouse)

Manual annotation of mouse from Havana has been updated and contains the data released in Vega 68

Vega Human annotation updated (Human)

Manual annotation of human from Havana has been updated and contains the data released in Vega 68

Vega Rat annotation updated (Rat)

Manual annotation of rat from Havana has been updated and contains the data released in Vega 68

New variation data

New dbSNP data for Human (Human)

Human is updated with the latest version of dbSNP (149)

New dbSNP data for Rat (Rat)

Rat is updated with the latest version of dbSNP (149)

New dbSNP data for Platypus (Platypus)

Platypus is updated with the latest version of dbSNP (149)

New dbSNP data for Opossum (Opossum)

Opossum is updated with the latest version of dbSNP (149)

PolyPhen version update (Human)

The version of PolyPhen run in Ensembl will be updated to 2.2.2r405c

Other updates

Compara

Schema: new dnafrag.codon_table_id column (all species)

to indicate which codon table should be used to translate sequences of this dnafrag. The information is a copy of what is stored in the Core database, but necessary to 1) make the dN/dS more efficient and 2) be able to handle alternative codon tables in the absence of a core database

Schema: new exon_boundaries table (all species)

Used to keep track of all the exon coordinates

Schema: new gene_member.biotype_group column (all species)

to indicate whether the gene is protein-coding, is a short ncRNA etc. This allows to load all the genes in one operation and make the homology pipelines filter their dataset using the compara database only instead of queryng the core databse

Schema: new genome_db.strain_name column (all species)

Used to indicate the name of this strain (complements taxon_id which only provides the species name)

Schema: new seq_member.has_translation_edits and seq_member.has_transcript_edits columns (all species)

used to flag the seq_members that have hardcoded transcript / protein sequences. When this happens, the data (exon coordinates + transcript sequence + translation sequence) is not in sync and some analyses have to be discarded

patch_87_88_a.sql - Schema version update (all species)

87 -> 88

H.sap alignments (all species)

We will topup all LastZ alignments for human vs all target species that have a karyotype.

Family REST endpoints (all species)

Addition of family REST endpoints

Pruned EPO alignments (all species)

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} span.s1 {font-kerning: none}

Method to retrieve EPO alignments for a given species subset.

Cafe Tree REST endpoints (all species)

Addition of Cafe Tree REST endpoints.

Protein Families (all species)

Updated HMM families including all Ensembl transcript isoforms (including human non-reference haplotypes) and newest Uniprot Metazoa.

 -- Clustering by PantherScore (based on Ensembl HMM library)
 -- Multiple Sequence Alignments with MAFFT (v.7.221)

ProteinTrees and homologies (all species)

GeneTrees (protein-coding) with new/updated genebuilds and assemblies

 -- all-vs-all blastp (ncbi-blast-2.2.30+)
 -- Clustering using hcluster_sg
 -- Multiple sequence alignments using MCoffee (Version_9.03.r1318) or Mafft (mafft-7.221)
 -- Phylogenetic reconstruction using TreeBeST
 -- Homology inference
 -- Pairwise gene-based dN/dS scores for high coverage species pairs only (both on orthologues and paralogues) (codeml/PAML v4.3)
 -- GeneTree stable ID mapping
 -- Per family gene dynamics using CAFE (v2.2)

 -- computation of pairwise gene-order conservation score

 -- comparison of orthologies with whole-genome alignments

 -- high-confidence calls

ncRNAtrees and homologies (all species)

  • Classification based on Rfam models (v12.1)
  • Multiple sequence alignments with Infernal
  • Phylogenetic reconstruction using RAxML
  • Phylogenetic reconstruction using FastTree2 and ExaML for very big families
  • Additional multiple sequence alignments with Prank (w/ genomic flanks)
  • Additional phylogenetic reconstruction using PhyML and NJ
  • Phylogenetic tree merging using TreeBeST
  • Per family gene dynamics using CAFE
  • Homology inference
  • Secondary structure plots

Core

External database references update (multiple species)

Xref updates for: alpaca, dolphin, fugu, gibbon, guinea pig, human, kangaroo rat, marmoset, mouse, orangutan, panda, platyfish, rat, sloth, tarsier, tasmanian devil and xenopus.

patch_87_88_b.sql (all species)

Name column in seq_region table expanded to varchar(255)

EnsemblGenomes

Latest versions of core databases for EG species (Caenorhabditis elegans, Fruitfly, Saccharomyces cerevisiae)

The core databases for C. elegans, D. melanogaster and S. cerevisiae have been updated by Ensembl Genomes to include more up-to-date cross-references, protein features etc., though the assembly and genebuilds have not changed. In addition, standard seq region synonyms have been added for the mitochondria for C. elegans and S. cerevisiae

Regulation

GTEx Update (Human)

Update to GTEx v6 used by the eQTL REST endpoint

Database schema changes (all species)

patch_87_88_b.sql - Allow seq_region name to be longer

patch_87_88_c.sql - sample_regulatory_feature_id field for regulatory build

Genebuild

Updated human otherfeatures db: New CCDS import (Human)

This release of the human gene set also includes 32,514 transcript models as part of an updated version (November 2016) of CCDS

Human: updated cDNA alignments (Human)

A new cdna database will be created for e88: The latest set of cDNAs for human (as of December 2016) from the European Nucleotide Archive and NCBI RefSeq will be aligned to the current genome using Exonerate.

Mouse: updated cDNA alignments (Mouse)

A new cdna database will be created for e88: The latest set of cDNAs for mouse (as of January 2017) from the European Nucleotide Archive and NCBI RefSeq will be aligned to the current genome using Exonerate.

Human and mouse: Ensembl-to-RefSeq comparison attributes (Human, Mouse)

For each Ensembl transcript present in the human and mouse core db, a comparison is carried out with all overlapping RefSeq transcripts from the otherfeatures db.

Up to five comparisons are carried out (depending on if the models are non-coding or coding):

1) Check if all exons coordinates match (all transcripts) 

2) Check if transcript sequences match (all transcripts)

3) Check if the CDS exon coordinates match (coding transcripts only)

4) Check if the CDS sequences match (coding transcripts only)

5) Check if the translation sequences match (coding transcripts only)

For non-coding models, if comparisons (1) and (2) are a match then the transcripts are considered to match on the whole transcript level and the Ensembl transcript is given an attribute to say there is a match on the whole transcript level.

For coding models if all five comparisons are true then the Ensembl transcript is given an attribute to say there is a match on the whole transcript level. Failing that, if comparisons (3), (4) and (5) are true the Ensembl transcript is given an attribute to say there is a match on the whole transcript level.

The stable ids of any matching RefSeq transcripts will be stored in the value field of the Ensembl transcript attribute.

Updated mouse otherfeatures db: New CCDS import (Mouse)

The latest CCDS mouse set will be imported.

mouse lemur lincRNA (Mouse Lemur)

Adding lincRNA models to core db

Platypus: Assembly Synonyms added (Platypus)

Genbank accessions to Ensembl sequence names to be added as synonyms to the Platypus database.

Pika: assembly name change (Pika)

Owing to the fragmentary nature of the OchPri2.0 assembly, it was necessary to arrange some scaffolds into "gene-scaffold" super-structures,in order to present complete genes. The pika assembly name related values will be changed to 'OchPri2.0-Ens' to allow the name to accurately reflect the fact the genome as presented by Ensembl differs from OchPri-2.0. 

Production

Ensembl 88 mart databases (all species)

  • Ensembl Genes 88
    • Region filter performance improvement
    • Renamed some filter/attributes internal and display names
    • Renamed attribute "% GC content" to "Gene % GC content" 
  • Mouse Genes 88
  • Ensembl Variation 88
    • Region filter performance improvement
  • Ensembl Regulation 88
    • Region filter performance improvement
  • Vega 68

Minor change to MySQL dump format (all species)

There will be a very minor change to how we generate the .sql files provided as part of our FTP dumps for release 88 onwards. The change is that the SQL will be produced directly as a single file from mysqldump -d, which should make the SQL file easier to generate and use. This has been in place for Ensembl Genomes for a number of years without issue.

As an example, you can compare the current format:

ftp://ftp.ensembl.org/pub/release-87/mysql/saccharomyces_cerevisiae_core_87_4/saccharomyces_cerevisiae_core_87_4.sql.gz

with the new format:

ftp://ftp.ensemblgenomes.org/pub/release-34/fungi/mysql/saccharomyces_cerevisiae_core_34_87_4/saccharomyces_cerevisiae_core_34_87_4.sql.gz

Variation

COSMIC data update (Human)

Imported cancer data from COSMIC version 79.

This import excludes the COSMIC alleles, populations and the mutations types.

Structural variants (Human, Pig, Sheep)

  • Added new studies from DGVa
  • Updated some of the existing studies from DGVa

HGMD-Public dataset (Human)

HGMD data will be updated to version 2016.4 (December 2016)

Phenotype data updates (all species)

  • Updated Human phenotype data from different sources including NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Cosmic Gene Census, DDG2P, MIM Morbid and Orphanet.
  • OMIA data for several species
  • AnimalQTL data for several species
  • RGD data for Rat
  • ZFIN data for Zebrafish
  • IMPC data for Mouse
  • MGI data for Mouse

Update LD Rest endpoints (all species)

We update the ld/id and ld/region endpoints. The population name is now a required parameter for the ld/id and ld/region endpoints.

VEP switching to ensembl-vep (all species)

The officially supported VEP repository will move from ensembl-tools to ensembl-vep.

The ensembl-tools version will remain available for one release, and after that be available only on archive branches.

New REST API phenotype endpoint (all species)

Creation of new REST API endpoint to get the phenotype associations overlapping a defined region.

Future Plans

Read about our future plans on our blog!