Mouse Lemur assembly and gene annotation

Assembly

The Mmur_2.0 assembly was submitted by the Baylor College of Medicine in 2015. The assembly is on the scaffold level, consisting of 56,281 contigs assembled into 10,311 scaffolds.

The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer. The N50 length for the contigs 182,929 while the scaffold N50 is 3,711,085. This represents a major improvement over Mmur_1.0 where these figures were 3.51kb for contigs and 107.02 kb for supercontigs. This improvement in contig and scaffold N50 has allowed for a much more complete annotation of the mouse lemur genome.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000165445.2

Other assemblies

Gene annotation

The annotation for the protein -coding genes was carried out using three main techniques:

  • Mouse lemur RNA-seq data from nine tissues.
  • Whole genome alignment against the human GRCh38 assembly followed by projection of GENCODE Basic protein-coding transcripts in regions of sufficient alignment conservation.
  • Splice-aware alignment of a subset of UniProt proteins to the Mmur_2.0 assembly.

The subset of UniProt proteins used was our 'primates basic' set. This consisted of the proteins from the following clades and protein existence (PE) levels:

  • Human PE level 1 & 2 proteins.
  • Other primates PE level 1, 2 & 3 proteins.
  • Mouse PE level 1 & 2 proteins.
  • Other mammals PE level 1 & 2 proteins.
  • Other vertebrates PE level 1 & 2 proteins.

UTRs were obtained (where possible) from the RNA-seq data and alignments of RefSeq mouse lemur cDNAs.

Small ncRNAs were obtained using a combination of BLAST and Infernal/RNAfold.

Pseudogenes were calculated by looking at genes with a large percentage of non-biological introns (introns of <10bp), where the gene was covered in repeats, or where the gene was single exon and evidence of a functional multi-exon paralog was found elsewhere in the genome.

The annotation process is described in the document below.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyMmur_2.0, INSDC Assembly GCA_000165445.2, May 2015
Base Pairs2,377,792,829
Golden Path Length2,438,821,538
Annotation providerEnsembl
Annotation methodFull genebuild
Genebuild startedJun 2016
Genebuild releasedOct 2016
Genebuild last updated/patchedApr 2017
Database version90.20

Gene counts

Coding genes18,103
Non coding genes10,223
Small non coding genes4,788
Long non coding genes2,636
Misc non coding genes2,799
Pseudogenes208
Gene transcripts46,339

Other

Genscan gene predictions55,168

About this species