Macaque assembly and gene annotation


Mmul_1 is a preliminary assembly of the Indian-origin rhesus monkey, Macaca mulatta using whole genome shotgun (WGS) reads from small and medium insert clones. Several WGS libraries, with inserts of 2-4 kb and 10 kb, fosmids with ~35kb inserts, and BACs with 180kb inserts were used to produce the data.

The release was produced by the Macaque Genome Sequencing Consortium, led by the Baylor College of Human Medicine, melding three separate complementary assemblies (created using the Atlas, Celera and PCAP systems). This involved iteratively splitting likely chimeric scaffolds and joining together existing scaffolds where possible. Chimeric scaffolds (<100 total) were identified by breaks in synteny with the human genome, which were confirmed to be artefacts by the other assemblies.

This is a draft sequence and may contain errors so users should exercise caution. Typical errors in draft genome sequences include misassemblies of repeated sequences, collapses of repeated regions, and unmerged overlaps (e.g. due to polymorphisms) creating artificial duplications. However base accuracy in contigs (contiguous blocks of sequence) is usually very high with most errors near the ends of contigs. [More about the assembly].

Gene annotation

The gene set for macaque was built using the Ensembl pipeline. The species-specific resources for macaque are relatively limited, so we decided to take a combined approach utilizing macaque's great similarity to human to aid our annotation efforts. The gene structures are mainly based on alignments to human and macaque protein data. Both macaque and human cDNAs were used to add UTR structures, and finally gene predictions based on Uniprot proteins and human cdnas were used to fill gaps in the annotation.

More information

General information about this species can be found in Wikipedia.



AssemblyMMUL 1.0, Feb 2006
Database version80.10
Base Pairs3,093,871,206
Golden Path Length

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

Genebuild byEnsembl
Genebuild methodFull genebuild
Genebuild startedJan 2006
Genebuild releasedAug 2006
Genebuild last updated/patchedMay 2010

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

Non coding genes6,579
Small non coding genes

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline. Please note that tRNAs are annotated separately using tRNAscan. tRNAs are included as 'simple fetaures', not genes, because they are not annotated using aligned sequence evidence.

Misc non coding genes1,124

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.44,725


Genscan gene predictions125,893
Short Variants3,123,522
Structural variants123