News for Xenopus Ensembl Release 88 (March 2017)

News categories

New web displays and tools

Masthead redesign

As we begin to incorporate strains and other non-reference assemblies into Ensembl, we want to display more information about a "species" on all relevant pages so that users are aware of the differences.

We have therefore rearranged the top of our browser pages, removing the species "tab" and putting the information in the dark blue area above the tabs where there is more space. The dropdown list of species is still accessible via the triangle icon, and all other functionality is unaffected.

Find a Data Display

Ensembl has lots of different pages where you can access data or see visualisations of that data. Now, to help you find the display you need, we have introduced our "Find a data display" page. Just type a gene identifier, location or RSID into the form to see a list of relevant views, with thumbnails of sample data and links to that page for your feature or region of choice.

The page can be accessed from the central panel on the home page or via the 'Help and Documentation' link in the page header - in the latter, the link is near the top of the lefthand menu.

Other updates

Compara

Schema: new dnafrag.codon_table_id column

to indicate which codon table should be used to translate sequences of this dnafrag. The information is a copy of what is stored in the Core database, but necessary to 1) make the dN/dS more efficient and 2) be able to handle alternative codon tables in the absence of a core database

Schema: new exon_boundaries table

Used to keep track of all the exon coordinates

Schema: new gene_member.biotype_group column

to indicate whether the gene is protein-coding, is a short ncRNA etc. This allows to load all the genes in one operation and make the homology pipelines filter their dataset using the compara database only instead of queryng the core databse

Schema: new genome_db.strain_name column

Used to indicate the name of this strain (complements taxon_id which only provides the species name)

Schema: new seq_member.has_translation_edits and seq_member.has_transcript_edits columns

used to flag the seq_members that have hardcoded transcript / protein sequences. When this happens, the data (exon coordinates + transcript sequence + translation sequence) is not in sync and some analyses have to be discarded

patch_87_88_a.sql - Schema version update

87 -> 88

H.sap alignments

We will topup all LastZ alignments for human vs all target species that have a karyotype.

Family REST endpoints

Addition of family REST endpoints

Pruned EPO alignments

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} span.s1 {font-kerning: none}

Method to retrieve EPO alignments for a given species subset.

Cafe Tree REST endpoints

Addition of Cafe Tree REST endpoints.

Protein Families

Updated HMM families including all Ensembl transcript isoforms (including human non-reference haplotypes) and newest Uniprot Metazoa.

 -- Clustering by PantherScore (based on Ensembl HMM library)
 -- Multiple Sequence Alignments with MAFFT (v.7.221)

ProteinTrees and homologies

GeneTrees (protein-coding) with new/updated genebuilds and assemblies

 -- all-vs-all blastp (ncbi-blast-2.2.30+)
 -- Clustering using hcluster_sg
 -- Multiple sequence alignments using MCoffee (Version_9.03.r1318) or Mafft (mafft-7.221)
 -- Phylogenetic reconstruction using TreeBeST
 -- Homology inference
 -- Pairwise gene-based dN/dS scores for high coverage species pairs only (both on orthologues and paralogues) (codeml/PAML v4.3)
 -- GeneTree stable ID mapping
 -- Per family gene dynamics using CAFE (v2.2)

 -- computation of pairwise gene-order conservation score

 -- comparison of orthologies with whole-genome alignments

 -- high-confidence calls

ncRNAtrees and homologies

  • Classification based on Rfam models (v12.1)
  • Multiple sequence alignments with Infernal
  • Phylogenetic reconstruction using RAxML
  • Phylogenetic reconstruction using FastTree2 and ExaML for very big families
  • Additional multiple sequence alignments with Prank (w/ genomic flanks)
  • Additional phylogenetic reconstruction using PhyML and NJ
  • Phylogenetic tree merging using TreeBeST
  • Per family gene dynamics using CAFE
  • Homology inference
  • Secondary structure plots

Core

External database references update

Xref updates for: alpaca, dolphin, fugu, gibbon, guinea pig, human, kangaroo rat, marmoset, mouse, orangutan, panda, platyfish, rat, sloth, tarsier, tasmanian devil and xenopus.

patch_87_88_b.sql

Name column in seq_region table expanded to varchar(255)

Regulation

Database schema changes

patch_87_88_b.sql - Allow seq_region name to be longer

patch_87_88_c.sql - sample_regulatory_feature_id field for regulatory build

Production

Ensembl 88 mart databases

  • Ensembl Genes 88
    • Region filter performance improvement
    • Renamed some filter/attributes internal and display names
    • Renamed attribute "% GC content" to "Gene % GC content" 
  • Mouse Genes 88
  • Ensembl Variation 88
    • Region filter performance improvement
  • Ensembl Regulation 88
    • Region filter performance improvement
  • Vega 68

Minor change to MySQL dump format

There will be a very minor change to how we generate the .sql files provided as part of our FTP dumps for release 88 onwards. The change is that the SQL will be produced directly as a single file from mysqldump -d, which should make the SQL file easier to generate and use. This has been in place for Ensembl Genomes for a number of years without issue.

As an example, you can compare the current format:

ftp://ftp.ensembl.org/pub/release-87/mysql/saccharomyces_cerevisiae_core_87_4/saccharomyces_cerevisiae_core_87_4.sql.gz

with the new format:

ftp://ftp.ensemblgenomes.org/pub/release-34/fungi/mysql/saccharomyces_cerevisiae_core_34_87_4/saccharomyces_cerevisiae_core_34_87_4.sql.gz

About this species