Mouse assembly and gene annotation

Assembly

This site features the latest major assembly release for mouse. The primary assembly, GRCm38, was released by the Genome Reference Consortium in January 2012. It is based on Mus musculus strain C57BL/6J. This assembly is used by UCSC to create their mm10 database.

The GRCm38 primary assembly comprises 21 chromosomes and 22 unplaced scaffolds. Similar to the human genome assembly, the Genome Reference Consortium will be releasing additional sequence for GRCm38 in the form of minor releases (patches).

To convert your old data from Mouse assembly m37 to m38, click on 'Manage your data' on any mouse page and select 'Assembly converter' from the left-hand menu.

Patches

As the GRC maintains and improves the mouse reference assembly, patches are being introduced. These patches do not change the coordinates of the primary assembly. For more information, please see our Genome Assemblies help document.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000001635.4

Gene annotation

The mouse primary assembly GRCm38 was annotated using Ensembl's automatic annotation system. This includes an updated mouse-specific repeat library, RefSeq and Uniprot protein sequence data for annotating the coding regions of protein-coding genes, as well as mouse cDNAs and ESTs for annotation untranslated regions (UTRs) of protein-coding genes.

In current release, we continue to display a joint gene set based on the merge between the automatic annotation from Ensembl (May 2012) and the manually curated annotation from Havana. Transcripts from the two annotation sources were merged if they shared the same internal exon-intron boundaries (i.e. had identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Vega source transcripts (regardless of merge status) were included in the final merged gene set.

In addition to the gene set, we display alignments of mouse cDNA and EST sequences. The mouse cDNA alignments are updated for every Ensembl release. We also display alignments of sequences from UniProt, UniGene and the ENA vertebrate RNA collection, and ab initio gene predictions from Genscan. The Consensus Coding Sequence (CCDS) identifiers have also been mapped to the new assembly. More information about the CCDS project.

Vega logo Additional manual annotation of this genome can be found in Vega

HEROIC

Additional functional genomics data produced by the HEROIC project (High-throughput Epigenetic Regulatory Organisation In Chromatin) is available to download from the Ensembl Projects HEROIC portal.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyGRCm38.p2 (Genome Reference Consortium Mouse Reference 38), INSDC Assembly GCA_000001635.4, Jan 2012
Database version75.38
Base Pairs3,480,955,279
Golden Path Length2,730,871,774
Genebuild byEnsembl
Genebuild methodFull genebuild
Genebuild startedJan 2012
Genebuild releasedJul 2012
Genebuild last updated/patchedSep 2013
Gencode versionGENCODE M2

Gene counts (Primary assembly)

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

23,148 (incl. 104 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Short non coding genes

Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

5,860
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

4,074 (incl. 17 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

5,935 (incl. 3 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.94,647

Gene counts (Alternate sequence)

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

84
Short non coding genes

Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

62
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

4
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

12
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.282

Other

Genscan gene predictions56,884
Short Variants75,968,355
Structural variants1,850,091

InterPro Hits