Gorilla assembly and gene annotation

Assembly

This is the third release of the draft assembly of the Western lowland gorilla (Gorilla gorilla gorilla). The DNA sample came from a 30-year-old female, Kamilah, owned by the San Diego Wild Animal Park, and sequencing and assembly is provided by the Wellcome Trust Sanger Institute.

Sequencing was undertaken using two separate methods: traditional capillary whole-genome shotgun (WGS) sequencing and Solexa new-technology sequencing. Results from the two methods were used in the first, second and third draft gorilla assemblies.

The first draft assembly (gorGor1) was released in September 2008. This initial draft assembly was a 2.1x coverage assembly. It was created from WGS capillary reads using the Phusion assembler, with these capillary reads' sequencing errors being corrected by taking the consensus of Solexa data aligned to it.

To create the second draft assemby, Solexa data, sequenced at roughly 35x, was assembled into contigs using Abyss. The resulting contigs of length 50bp or longer were then assembled along with the WGS capillary data using the Phusion assembler. Next, the Solexa read pairs were aligned to the human reference genome using Maq to identify syntenic regions and breakpoints between human and gorilla. Using human-gorilla synteny as a guide, longer gorilla supercontigs were constructed using Velvet and other assembly tools.

In the third draft assembly (gorGor3) for the current release, gorilla supercontigs which could be ordered with respect to the human reference genome were assembled into simulated chromosomes, while incorporating the chromosome 2 split (as in chimpanzee) and the reciprocal translocation between chromosomes 5 and 17.

The total length of the gorGor3 assembly is 3.04Gb. The N50 size for contigs is 11657 bp and the N50 size for supercontigs is 913458 bp.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000151905.1

Gene annotation

Gene annotation in gorilla has been generated by projection of genes from the human reference genome as well as alignment of proteins from three major sources (in descending order of their contribution to the final gene set):

  1. Ensembl human translations from Ensembl release 56
  2. Uniprot mammalian and vertebrate proteins with evidence at either the protein or transcript level for their existence; and
  3. Gorilla gorilla proteins obtained from UniprotKB.

Projection of human genes to gorilla began with the alignment of gorilla genome to the latest human reference genome (GRCh37 assembly) using BLASTz. These alignments were used to project human Ensembl gene structures (Ensembl version 56) to the corresponding location in gorilla. About 60% of human protein-coding genes were projected onto the gorilla genome. Small insertions/deletions that disrupt the reading-frame of the resultant projected transcripts are corrected for by inserting "frame-shift" introns into the structure. For some human exons and parts of exons, the corresponding gorilla sequence is missing from the assembly. In most of these cases, the missing exon is omitted from the gorilla gene model. In a small number of cases however, where BLASTz has aligned the human sequence to a gap in the gorilla sequence, the exon is placed in the gap, resulting in a run of X's of the correct length in the translation.

Ensembl human translations were also aligned to the gorilla genome using Exonerate. The alignment of mammalian/vertebrate proteins and gorilla-specific proteins followed procedures in the standard Ensembl genebuild pipeline using Genewise.

The gene-building procedure on the gorGor3 assembly identified 20803 protein coding genes and 1553 pseudogenes.

Vega logo Additional manual annotation of this genome can be found in Vega

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblygorGor3.1, INSDC Assembly GCA_000151905.1, Dec 2009
Database version78.31
Base Pairs2,828,888,833
Golden Path Length

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

3,040,677,044
Genebuild byEnsembl
Genebuild methodFull genebuild
Genebuild startedAug 2009
Genebuild releasedMar 2010
Genebuild last updated/patchedJul 2011

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

20,962
Small non coding genes

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline. Please note that tRNAs are annotated separately using tRNAscan. tRNAs are included as 'simple fetaures', not genes, because they are not annotated using aligned sequence evidence.

6,701
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

1,553
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.35,727

Other

Genscan gene predictions50,831