News for Human Ensembl Release 88 (March 2017)

News categories

New web displays and tools

Masthead redesign

As we begin to incorporate strains and other non-reference assemblies into Ensembl, we want to display more information about a "species" on all relevant pages so that users are aware of the differences.

We have therefore rearranged the top of our browser pages, removing the species "tab" and putting the information in the dark blue area above the tabs where there is more space. The dropdown list of species is still accessible via the triangle icon, and all other functionality is unaffected.

Find a Data Display

Ensembl has lots of different pages where you can access data or see visualisations of that data. Now, to help you find the display you need, we have introduced our "Find a data display" page. Just type a gene identifier, location or RSID into the form to see a list of relevant views, with thumbnails of sample data and links to that page for your feature or region of choice.

The page can be accessed from the central panel on the home page or via the 'Help and Documentation' link in the page header - in the latter, the link is near the top of the lefthand menu.

New species, assemblies and genebuilds

Update to Ensembl-Havana human GENCODE gene set (release 26)

Updated Ensembl-Havana gene set (GENCODE release 26). This gene set is a merge of complete Ensembl gene models and the latest Havana gene annotation. All CCDS genes are included in this gene set.

The human GRCh38.p10 gene annotation is also included:

The patches for GRCh38.p10 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence. Annotation of the patches is stored in the core database.

Vega Human annotation updated

Manual annotation of human from Havana has been updated and contains the data released in Vega 68

New variation data

New dbSNP data for Human

Human is updated with the latest version of dbSNP (149)

PolyPhen version update

The version of PolyPhen run in Ensembl will be updated to 2.2.2r405c

Other updates

Compara

Schema: new dnafrag.codon_table_id column

to indicate which codon table should be used to translate sequences of this dnafrag. The information is a copy of what is stored in the Core database, but necessary to 1) make the dN/dS more efficient and 2) be able to handle alternative codon tables in the absence of a core database

Schema: new exon_boundaries table

Used to keep track of all the exon coordinates

Schema: new gene_member.biotype_group column

to indicate whether the gene is protein-coding, is a short ncRNA etc. This allows to load all the genes in one operation and make the homology pipelines filter their dataset using the compara database only instead of queryng the core databse

Schema: new genome_db.strain_name column

Used to indicate the name of this strain (complements taxon_id which only provides the species name)

Schema: new seq_member.has_translation_edits and seq_member.has_transcript_edits columns

used to flag the seq_members that have hardcoded transcript / protein sequences. When this happens, the data (exon coordinates + transcript sequence + translation sequence) is not in sync and some analyses have to be discarded

patch_87_88_a.sql - Schema version update

87 -> 88

H.sap alignments

We will topup all LastZ alignments for human vs all target species that have a karyotype.

Family REST endpoints

Addition of family REST endpoints

Pruned EPO alignments

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} span.s1 {font-kerning: none}

Method to retrieve EPO alignments for a given species subset.

Cafe Tree REST endpoints

Addition of Cafe Tree REST endpoints.

Protein Families

Updated HMM families including all Ensembl transcript isoforms (including human non-reference haplotypes) and newest Uniprot Metazoa.

 -- Clustering by PantherScore (based on Ensembl HMM library)
 -- Multiple Sequence Alignments with MAFFT (v.7.221)

ProteinTrees and homologies

GeneTrees (protein-coding) with new/updated genebuilds and assemblies

 -- all-vs-all blastp (ncbi-blast-2.2.30+)
 -- Clustering using hcluster_sg
 -- Multiple sequence alignments using MCoffee (Version_9.03.r1318) or Mafft (mafft-7.221)
 -- Phylogenetic reconstruction using TreeBeST
 -- Homology inference
 -- Pairwise gene-based dN/dS scores for high coverage species pairs only (both on orthologues and paralogues) (codeml/PAML v4.3)
 -- GeneTree stable ID mapping
 -- Per family gene dynamics using CAFE (v2.2)

 -- computation of pairwise gene-order conservation score

 -- comparison of orthologies with whole-genome alignments

 -- high-confidence calls

ncRNAtrees and homologies

  • Classification based on Rfam models (v12.1)
  • Multiple sequence alignments with Infernal
  • Phylogenetic reconstruction using RAxML
  • Phylogenetic reconstruction using FastTree2 and ExaML for very big families
  • Additional multiple sequence alignments with Prank (w/ genomic flanks)
  • Additional phylogenetic reconstruction using PhyML and NJ
  • Phylogenetic tree merging using TreeBeST
  • Per family gene dynamics using CAFE
  • Homology inference
  • Secondary structure plots

Core

External database references update

Xref updates for: alpaca, dolphin, fugu, gibbon, guinea pig, human, kangaroo rat, marmoset, mouse, orangutan, panda, platyfish, rat, sloth, tarsier, tasmanian devil and xenopus.

patch_87_88_b.sql

Name column in seq_region table expanded to varchar(255)

Regulation

GTEx Update

Update to GTEx v6 used by the eQTL REST endpoint

Database schema changes

patch_87_88_b.sql - Allow seq_region name to be longer

patch_87_88_c.sql - sample_regulatory_feature_id field for regulatory build

Genebuild

Updated human otherfeatures db: New CCDS import

This release of the human gene set also includes 32,514 transcript models as part of an updated version (November 2016) of CCDS

Human: updated cDNA alignments

A new cdna database will be created for e88: The latest set of cDNAs for human (as of December 2016) from the European Nucleotide Archive and NCBI RefSeq will be aligned to the current genome using Exonerate.

Human and mouse: Ensembl-to-RefSeq comparison attributes

For each Ensembl transcript present in the human and mouse core db, a comparison is carried out with all overlapping RefSeq transcripts from the otherfeatures db.

Up to five comparisons are carried out (depending on if the models are non-coding or coding):

1) Check if all exons coordinates match (all transcripts) 

2) Check if transcript sequences match (all transcripts)

3) Check if the CDS exon coordinates match (coding transcripts only)

4) Check if the CDS sequences match (coding transcripts only)

5) Check if the translation sequences match (coding transcripts only)

For non-coding models, if comparisons (1) and (2) are a match then the transcripts are considered to match on the whole transcript level and the Ensembl transcript is given an attribute to say there is a match on the whole transcript level.

For coding models if all five comparisons are true then the Ensembl transcript is given an attribute to say there is a match on the whole transcript level. Failing that, if comparisons (3), (4) and (5) are true the Ensembl transcript is given an attribute to say there is a match on the whole transcript level.

The stable ids of any matching RefSeq transcripts will be stored in the value field of the Ensembl transcript attribute.

Production

Ensembl 88 mart databases

  • Ensembl Genes 88
    • Region filter performance improvement
    • Renamed some filter/attributes internal and display names
    • Renamed attribute "% GC content" to "Gene % GC content" 
  • Mouse Genes 88
  • Ensembl Variation 88
    • Region filter performance improvement
  • Ensembl Regulation 88
    • Region filter performance improvement
  • Vega 68

Minor change to MySQL dump format

There will be a very minor change to how we generate the .sql files provided as part of our FTP dumps for release 88 onwards. The change is that the SQL will be produced directly as a single file from mysqldump -d, which should make the SQL file easier to generate and use. This has been in place for Ensembl Genomes for a number of years without issue.

As an example, you can compare the current format:

ftp://ftp.ensembl.org/pub/release-87/mysql/saccharomyces_cerevisiae_core_87_4/saccharomyces_cerevisiae_core_87_4.sql.gz

with the new format:

ftp://ftp.ensemblgenomes.org/pub/release-34/fungi/mysql/saccharomyces_cerevisiae_core_34_87_4/saccharomyces_cerevisiae_core_34_87_4.sql.gz

Variation

COSMIC data update

Imported cancer data from COSMIC version 79.

This import excludes the COSMIC alleles, populations and the mutations types.

Structural variants

  • Added new studies from DGVa
  • Updated some of the existing studies from DGVa

HGMD-Public dataset

HGMD data will be updated to version 2016.4 (December 2016)

Phenotype data updates

  • Updated Human phenotype data from different sources including NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, Cosmic Gene Census, DDG2P, MIM Morbid and Orphanet.
  • OMIA data for several species
  • AnimalQTL data for several species
  • RGD data for Rat
  • ZFIN data for Zebrafish
  • IMPC data for Mouse
  • MGI data for Mouse

Update LD Rest endpoints

We update the ld/id and ld/region endpoints. The population name is now a required parameter for the ld/id and ld/region endpoints.

VEP switching to ensembl-vep

The officially supported VEP repository will move from ensembl-tools to ensembl-vep.

The ensembl-tools version will remain available for one release, and after that be available only on archive branches.

New REST API phenotype endpoint

Creation of new REST API endpoint to get the phenotype associations overlapping a defined region.

About this species