News for Human Ensembl Release 80 (May 2015)

News categories

New web displays and tools

New export options for comparative views

As part of our ongoing upgrade of the export interface, Release 80 includes the following new features:

Homologues

Sequence export has been added for both orthologues and paralogues. Unaligned gene sequences can be exported in FASTA format, whilst multiple alignments can be exported in all our usual BioPerl formats (including FASTA); both types can be exported as DNA or amino acids.

Note that at the moment you cannot filter the paralogues by species, as that option is not available within the paralogue page itself.

Gene Trees

Previously, the only export available from individual nodes of the tree was in FASTA and Newick formats. These links have been replaced with one that opens the export interface, so you can choose any of the same formats as the full tree export.

OrthoXML filtering

Exports in orthoXML now honour the current page settings, allowing you to export only the homologues you see.

New styles for BigWig files on karyotype

Last release we added the ability to display your BigWig data on whole chromosomes or the entire karyotype, as line graphs or histograms. In release 80 we have added two new track styles for line graphs:

1 - "raw mean", i.e. mean values scaled relative to one another, instead of relative to the maximum value in that region (the default setting)

2 - "mean with whiskers", which combines the line graph with "whiskers" to display the minimum and maximum values within each bin.

To use either style, go to the track in the "Configure this page" screen, and choose from the style dropdown.

New userdata track type: long-range interactions

We are very pleased to announce that Ensembl now supports long-range pairwise interaction data, which can be drawn as arcs on Region in Detail. Scores are indicated using a grey-to-black gradient, and labels can be displayed by selecting the appropriate track style from the configuration menu.

Initially we are supporting the two formats developed by WashU for their Epigenomics browser: a simple text-based format which can be pasted into the form or uploaded from your computer, and a richer tabix-indexed format that must be attached via a URL. More information on both formats can be found in our online documentation.

We hope to support more formats in the future, so please let us know which formats you are currently using!

New species, assemblies and genebuilds

Human: RefSeq-to-Ensembl model comparison attributes

For each refseq_import transcript model present in the human otherfeatures db, a comparison is carried out with all overlapping Ensembl transcript models from the core db.

Initially the models are compared on the whole transcript level, all exons are compared in terms of genomic coordinates and the transcript sequences of the two models are also compared.

For non-coding models, if both of these comparisons match then the models are considered to match on the whole transcript level and the RefSeq model is given an attribute to say there is a match on the whole transcript level. If no overlapping Ensembl model meets the criteria the RefSeq model is given a transcript attribute to denote this.

For models where a CDS is defined there is an extra level of comparison. The coding exon coordinates, CDS and translation sequences of both models are also compared.

If all exons coordinates (coding and non-coding) and the transcript, CDS and the translation sequences all match then the RefSeq model is given an attribute to say there is a match on the whole transcript level.

Failing this, a comparison is done on the coding exons coordinates, CDS and translation sequences only. If the comparisons now match, the RefSeq transcript is given an attribute to denote that there is a match on the CDS level only.

If there are still no matching Ensembl transcripts at this point the RefSeq transcript is given an attribute to denote that there is no matching Ensembl model.

All matching Ensembl models have their stable ids listed in the value field of the corresponding transcript attribute for the RefSeq model.

Vega Zebrafish annotation updated

Manual annotation of zebrafish from Havana has been updated and contains the data released in Vega 59

New variation data

1000 Genomes Phase 3

Genotypes from 1000 Genomes Phase 3 will be available ( this replaces the phase 1 data)

dbSNP142 import for human

The human GRCh38 variation database will be updated to dbSNP142

API and schema changes

start_lost to replace initiator_codon_variant consequence type

We will replace the use of initiator_codon_variant with the more specific start_lost. The difference between the two is largely semantic.

The new term protein_altering_variant will be used for variants within the protein which are not better described by any of its child terms 

API: new method to get the multiple alignment of several homologues

The new method is GeneTree::get_alignment_of_homologues($ref_member)

Other updates

Compara

Schema version update

79 -> 80

API: new methods to fetch data from a Gene / Transcript / DnaFrag

To improve the usability of our API, we'll add methods to fetch the data directly from the Core objects, without having to create Members and DnaFrags. Members and DnaFrags will still be used to represent the data on our side, though

D.rer GRCz10 pairwise alignments and syntenies

  • lastz D.rer vs T.rub (on D.rer)
  • lastz D.rer vs T.nig (on D.rer used to be T.nig)
  • lastz D.rer vs G.acu (on D.rer used to be G.acu)
  • lastz D.rer vs O.lat (on D.rer)
  • lastz D.rer vs O.nil (on D.rer)
  • lastz D.rer vs X.mac (on D.rer)
  • lastz D.rer vs L.ocu (on D.rer)
  • lastz D.rer vs G.mor (on D.rer)
  • lastz D.rer vs A.mex (on D.rer)
  • lastz D.rer vs P.for (on D.rer)
  • lastz D.rer vs G.gal (on G.gal)
  • lastz D.rer vs H.sap (on H.sap)
  • lastz D.rer vs M.mus (on M.mus)
  • lastz D.rer vs X.tro (on D.rer)
  • lastz D.rer vs L.cha (on D.rer)
  • lastz D.rer vs P.mar (on D.rer)
  • lastz D.rer vs C.sav (on D.rer)
  • lastz D.rer vs C.int (on D.rer)

Synteny maps will be generated when both species have their karyotype stored in the database

R.nor Rnor_v6.0 pairwise alignments and syntenies

  • lastz R.nor vs M.mus (on M.mus) + synteny
  • lastz R.nor vs H.sap (on H.sap) + synteny

D.rer GRCz10 multiple alignments

  • 5-way fish EPO alignments
  • 11-way fish EPO-2X alignments

R.nor Rnor_v6.0 multiple alignments

  • 17-way eutherian EPO alignments
  • 39-way eutherian EPO-2X alignments
  • 23-way amniota MercatorPecan alignments

We will also regenerate the "Age Of Base" human track from the new 17way EPO MSA

Core

External database references update

Xrefs update for:

danio_rerio (zebrafish), homo_sapiens (human), rattus_norvegicus (rat), mus_musculus, (mouse), gasterosteus_aculeatus (stickleback), latimeria_chalumnae (coealacanth), ciona_savignyi, anas_platyrhynchos (duck), tupaia_belangeri (tree shrew), erinaceus_europaeus (hedgehog), echinops_telfairi (tenrec), ictidomys_tridecemlineatus (squirrel), oryctolagus_cuniculus (rabbit), pelodiscus_sinensis (softshell turtle), ficedula_albicollis (flycatcher), papio_anubis (olive baboon)

LRG Import

Importing the latest version of Locus Reference Genomic dataset

Ensembl VM Build

The Ensembl Virtual Machine applicance will be updated to version 80.

patch_79_80_a.sql - schema_version update

Update schema_version in meta table to 80.

Stable ID lookup

Stable ID lookup provided for REST services

Includes lookup for RefSeq and CCDS entries

patch_79_80a.sql - schema_version update in production db

Update schema_version in production database to 80.

patch_79_80_a.sql - schema_version update in ontology db

Update schema_version in meta table to 80.

patch_79_80_b.sql

increase length of dbprimary_acc column in xref table

patch_79_80_c.sql

increase length of synonym column in seq_region_synonym table

patch_79_80_d.sql

increase length of value column in genome_statistics table

Regulation

patch_79_80_c - stable_id changed to varchar

The regulatory_feature.stable_id field was changed form an int to a varchar. API support was implemented to handle this.

Micro Array Mapping

Microarray updates

Micro array mappping was carried out for species with updated gene builds:

  • Rat
  • Zebrafish

Added the missing transcript annotations for 

  • HuEx-1_0-st-v2
  • HuGene-1_0-st-v1
  • HuGene-2_0-st-v1

Added new Human Affymetrix array:

  • HTA-2_0 

Human Segmentation adjacent feature merge

Adjacent segmentation features with the same segmentation classification were merged into a single feature.

patch_79_80_b dbfile_registry unique key

A unique key patch was applied to the dbfile_registry table.

BindingMatrix

Adding matrix method to BindingMatrix to store the matrix array. This will be used to generate the frequency string.

patch_79_80_a.sql - schema_version update

patch_79_80_a.sql - schema_version update

Genebuild

Human: updated cDNA alignments

A new cdna database will be created for e80: The latest set of cDNAs for human from the European Nucleotide Archive and NCBI RefSeq will be aligned to the current genome using Exonerate.

Human: updated RefSeq gene import

The imported RefSeq gene set was updated in the human otherfeatures database. Please note that RefSeq annotates gene models on cDNA sequence and not on the reference genome, meaning that when users choose to translate the RefSeq transcripts off the reference genome that the translations may contain stop codons.

Human: Refseq-genomic-to-mRNA comparison attributes

Transcript attributes will be added for the refseq_import geneset in the human otherfeatures db. Each refseq_import transcript will have an attribute to denote whether the genomic sequence that the transcript covers matches the mRNA sequence that the transcript is based on (the sequences present in the RefSeq mRNA file).

A prefect match is denoted as an alignment across the entirety of both sequences that contains no mismatches or indels. If initially there is a mismatch, the RefSeq mRNA will go through polyA clipping and the sequences will be compared again to see if a perfect match is possible post polyA clipping.

Transcripts that do not have a perfect match between the mRNA and the genomic sequence will get additional attributes to define what regions (5' UTR, CDS, 3' UTR, or 'whole transcript' if there is no CDS defined) do not align perfectly, along with a summary of the information in the alignment (match,mismatch, indel count, total indel length).

Production

EMBL and Genbank Dumps

EMBL and Genbank dumps for all species.

Ensembl 80 mart databases

  • Ensembl Genes 80
    • Added Protein domains start and end attributes (protein-based coordinate)
    • Renamed "ENCODE region" filter to "ENCODE Pilot Regions", added a link to the publication (http://www.genome.gov/26525202
    • Renamed the following Uniprot filters and attributes
      • "UniProt/Swissprot ID" to "UniProt/Swissprot Accession"
      • "UniProt/TrEMBL ID" to "UniProt/Swissprot Accession"
      • "UniProt Genename ID" to "UniProt Gene Name"
      • "Uniprot Genename Transcript Name" to "Uniprot Transcript Name"
    • Updated rat (Rnor_6.0) and zebrafish (GRCz10) assemblies
    • New Phenotype source filter in the "Phenotype" filter section
    • Renamed "APPRIS principal isoform annotation" filter and attributes to "APPRIS annotation"
  • Ensembl Variation 80
    • New Phenotype source filter in the "General variation" filter section
    • Updated rat (Rnor_6.0) and zebrafish (GRCz10) assemblies
  • Ensembl Regulation 80
  • Vega 60
    • Updated rat (Rnor_6.0) and zebrafish (GRCz10) assemblies
    • Added Protein domains start and end attributes (protein-based coordinate)

External reference projection

Gene ontology (GO) identifiers and gene name projection to all species.

FASTA & GTF dumps

FASTA & GTF dumps for all the species

New Ensembl BioMart documentation

We now have a brand new Ensembl BioMart documentation, we have re-organised, updated and added the following new pages:

  • Combining multiple species datasets
  • BiomaRt, Bioconductor R package

  • BioMart perl API

  • BioMart RESTful access (Perl and wget)

You can find the new documentation on the following page: http://www.ensembl.org/info/data/biomart/index.html

Variation

HGMD data update

Import of the latest release of public HGMD data (version 2014.4) and remapping to GRCh38

Phenotype data updates

  • Human phenotype data will be updated from different sources including NHGRI-EBI GWAS, OMIM, ClinVar, UniProt, DDG2P and Decipher.
  • OMIA data for Cow, Dog, Zebrafish, Horse, Cat, Chicken, Macaque, Turkey, Sheep and Chimpanzee
  • RGD data for Rat
  • AnimalQTL for Cow, Horse, Chicken, Pig
  • ZFIN for Zebrafish
  • EuroPhenome, 3i, IMPC, MGP for Mouse

Structural variations

  • Added new studies and updated other studies from DGVa.
  • New human study for 1000 Genomes - phase 3

Personal Genomes Data

Data from the Personal Genomes project will no longer be imported

Update ESP data GRCh38

Update Exome Sequencing Project data for human GRCh38 v.0.0.30. (Nov. 3, 2014).

Web

Deprecation of Sanger::Graphics

To improve long-term maintainability of the Perl GD drawing code, in release 79 we moved all necessary functionality from the Sanger::Graphics namespace into EnsEMBL::Draw. All Sanger::Graphics modules have been deprecated as of release 80, and they will be removed in release 82.

This change should only affect developers whose code calls methods directly from Sanger::Graphics instead of inheriting from EnsEMBL::Draw modules.

Note that a few helper modules, including ColourMap, have been moved into EnsEMBL::Draw::Utils.

New web dependency - ensembl-io

Starting with release 80, the Ensembl webcode will have an additional dependency: ensembl-io, our Git repository for file-parsing code. This can be checked out from GitHub the same way as our other repos, and is also included in the "web" group of repos used by ensembl-git-tools, so that the command

git ensembl --clone web

will automatically clone ensembl-io in addition to existing web code.

Initially this dependency will only affect variation data, but the plan is to integrate the new parsers into other areas of the website such as user uploads. Deprecation of the old parser modules will be announced in due course.

Gene Expression Atlas Widget

The Gene Expression Atlas widget has been embedded in ensembl. You can now view where the gene is expressed anatomically (where exactly in the species) and also which experiment it is associated with.

www.ensembl.org/Homo_sapiens/Gene/ExpressionAtlas?g=ENSG00000174485;r=15:65658046-65792293 

The code has been added to the widgets plugin.

GlyphSets at risk

The following GlyphSets are unused in the core Ensembl webcode. Third-party integrators should be aware that they are at risk of deletion in a future release. If you have any use for the following GlyphSets, please contact the Ensembl team.

_text, ctcf, fg_wiggle, GlyphSet_feature, histone_modifications, ld2, lsv_variations, missing, P_protdas, P_separator, preliminary, restrict, simple_histogram, tsv_missing, urlfeature, Vrefseqs

Tools (BLAST & BLAT) gap initiation update

Gap initiation and extension penalties has been updated/changed to reflect the correct order which is dependant on the matrix (BLOSUM and PAM) and the same has been done for BLASTN which is dependant on the match and mismatch score.

Download the ncRNA secondary-structure as SVG

You can now download the ncRNA secondary structure view as  SVG. A link has been added at the top of the view.

Track label improvements for images

Some tracks in images now appear within sections, grouping common tracks within a category.

Each section is identified by a heading underlined in a certain colour, and each track within that section by the same colour being used on its left-hand side.

Also, some tracks now have labels within the image itself, to allow particularly long labels. These in-image labels can be configured on or off via the configuration panel.

Initially these features are primarily targeted at trackhub support, but will be increasingly used for other tracks as the opportunity is identified.