Macaque assembly and gene annotation

Assembly

The Mmul_10 assembly was submitted by The Genome Institute at Washington University School of Medicine on March 2019. The assembly is on chromosome level, consisting of 3,182 contigs assembled into 2,979 scaffolds. From these sequences, 22 chromosomes have been built. The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer. The N50 length for the contigs is 46,608,966 while the scaffold N50 is 82,346,004.

Other assemblies

MMUL_1 (Ensembl release 80)

Gene annotation

The annotation for the protein -coding genes was carried out using three main techniques:

Rhesus macaque RNA-seq data from thirteen tissues.
Whole genome alignment against the human GRCh38 assembly followed by projection of GENCODE Basic protein-coding transcripts in regions of sufficient alignment conservation.
Splice-aware alignment of a subset of UniProt proteins to the Mmul_8.0.1 assembly.

The subset of UniProt proteins used was our 'primates basic' set. This consisted of the proteins from the following clades and protein existence (PE) levels:

Human PE level 1 & 2 proteins.
Other primates PE level 1, 2 & 3 proteins.
Mouse PE level 1 & 2 proteins.
Other mammals PE level 1 & 2 proteins.
Other vertebrates PE level 1 & 2 proteins.

UTRs were obtained (where possible) from the RNA-seq data and alignments of RefSeq rhesus macaque cDNAs.

Small ncRNAs were obtained using a combination of BLAST and Infernal/RNAfold.

Long intergenic ncRNAs were annotated from transcripts produced during the RNA-seq pipeline that had either a poor or no BLAST hit to any UniProt vertebrate PE12 protein. These transcripts are then scanned for evidence of protein domains. If no evidence was found the model was marked as lincRNA.

The annotation process is described in the document below.

Detailed information on macaque genebuild (PDF)

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

Assembly	Mmul_10, INSDC Assembly GCA_003339765.3, Feb 2019
Base Pairs	2,971,331,530
Golden Path Length	2,971,331,530
Annotation provider	Ensembl
Annotation method	Full genebuild
Genebuild started	Apr 2019
Genebuild released	Sep 2019
Genebuild last updated/patched	Dec 2019
Database version	115.10

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes	21,761
Non coding genes	12,904
Small non coding genes	4,712
Long non coding genes	4,773
Misc non coding genes	3,419
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes	767
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	64,228

Other

Genscan gene predictions	47,724
Short Variants	51,642,961
Structural variants	110

Macaque assembly and gene annotation

Assembly

Other assemblies

Gene annotation

More information

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us

Favourite species

All species

Macaque assembly and gene annotation

Assembly

Other assemblies

Gene annotation

More information

Statistics

Summary

Gene counts

Other

About Us

Get help

Our sister sites

Follow us