AssemblyThis is the third release of the draft assembly of the Western lowland gorilla (Gorilla gorilla gorilla). The DNA sample came from a 30-year-old female, Kamilah, owned by the San Diego Wild Animal Park, and sequencing and assembly is provided by the Wellcome Trust Sanger Institute.
Sequencing was undertaken using two separate methods: traditional capillary whole-genome shotgun (WGS) sequencing and Solexa new-technology sequencing. Results from the two methods were used in the first, second and third draft gorilla assemblies.
The first draft assembly (gorGor1) was released in September 2008. This initial draft assembly was a 2.1x coverage assembly. It was created from WGS capillary reads using the Phusion assembler, with these capillary reads' sequencing errors being corrected by taking the consensus of Solexa data aligned to it.
To create the second draft assemby, Solexa data, sequenced at roughly 35x, was assembled into contigs using Abyss. The resulting contigs of length 50bp or longer were then assembled along with the WGS capillary data using the Phusion assembler. Next, the Solexa read pairs were aligned to the human reference genome using Maq to identify syntenic regions and breakpoints between human and gorilla. Using human-gorilla synteny as a guide, longer gorilla supercontigs were constructed using Velvet and other assembly tools.
In the third draft assembly (gorGor3) for the current release, gorilla supercontigs which could be ordered with respect to the human reference genome were assembled into simulated chromosomes, while incorporating the chromosome 2 split (as in chimpanzee) and the reciprocal translocation between chromosomes 5 and 17.
The total length of the gorGor3 assembly is 3.04Gb. The N50 size for contigs is 11657 bp and the N50 size for supercontigs is 913458 bp.
The genome assembly represented here corresponds to GenBank Assembly ID GCA_000151905.1
Gene annotation in gorilla has been generated by projection of genes from the human reference genome as well as alignment of proteins from three major sources (in descending order of their contribution to the final gene set):
- Ensembl human translations from Ensembl release 56
- Uniprot mammalian and vertebrate proteins with evidence at either the protein or transcript level for their existence; and
- Gorilla gorilla proteins obtained from UniprotKB.
Projection of human genes to gorilla began with the alignment of gorilla genome to the latest human reference genome (GRCh37 assembly) using BLASTz. These alignments were used to project human Ensembl gene structures (Ensembl version 56) to the corresponding location in gorilla. About 60% of human protein-coding genes were projected onto the gorilla genome. Small insertions/deletions that disrupt the reading-frame of the resultant projected transcripts are corrected for by inserting "frame-shift" introns into the structure. For some human exons and parts of exons, the corresponding gorilla sequence is missing from the assembly. In most of these cases, the missing exon is omitted from the gorilla gene model. In a small number of cases however, where BLASTz has aligned the human sequence to a gap in the gorilla sequence, the exon is placed in the gap, resulting in a run of X's of the correct length in the translation.
Ensembl human translations were also aligned to the gorilla genome using Exonerate. The alignment of mammalian/vertebrate proteins and gorilla-specific proteins followed procedures in the standard Ensembl genebuild pipeline using Genewise.
The gene-building procedure on the gorGor3 assembly identified 20803 protein coding genes and 1553 pseudogenes.
Additional manual annotation of this genome can be found in Vega
General information about this species can be found in Wikipedia.
|Assembly:||gorGor3.1, Dec 2009|
|Golden Path Length:||3,040,677,044|
|Genebuild method:||Full genebuild|
|Genebuild started:||Aug 2009|
|Genebuild released:||Mar 2010|
|Genebuild last updated/patched:||Jul 2011|
Genes and/or transcript that contains an open reading frame (ORF).:
|Short non coding genes
Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.:
A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.:
|Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.:||35,727|
|Genscan gene predictions:||50,831|