Chimpanzee assembly and gene annotation

Assembly

This site displays version 2.1.4 (February 2011) of the chimpanzee genome assembly (known as Pan_troglodytes-2.1.4 or CHIMP2.1.4).The whole genome shotgun sequence data were assembled and organized by the Washington University Genome Center.The underlying whole genome shotgun data were generated at the Washington University School of Medicine and the Broad Institute.

This assembly, version 2.1.4, has an updated chromosome Y compared to version 2.1.3. Assembly 2.1.3 represented an improvement on the 2.1 chimp assembly by adding in over 300,000 finishing reads, and merging in 640 finished BACS. There were approximately 49,000 additional merges made in that assembly as compared to the 2.1 assembly.

This assembly covers about 97 percent of the genome and is based on 6X sequence coverage. It is composed of 185,384 contigs with an N50 length of 53kb, and 33,990 supercontigs with an N50 length of 9Mb.

The whole genome shotgun data from primary donor-derived reads (Clint, a captive-born male chimpanzee from the Yerkes Primate Research Center (Atlanta,USA)) were assembled using PCAP (Huang 2006) using stringent parameters derived by eliminating detectable global mis-assemblies (interchromosomalcross-overs determined by alignment of the chimpanzee genome against the human genome) larger than 50kb.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000001515.4

Gene annotation

The genome was aligned to human GRCh37 using BLASTz in an eHive pipeline. These alignments were used to transfer human ensembl gene structures (Human Build 63) to chimpanzee. 97.7% of the chimp-specific proteins were aligned to the chimp genome in a first layer of annotation. The missing proteins mainly contain multiple internal stop codons in the assembled genome.

1371 chimp-specific protein sequences were used during the gene build process and were aligned using a combination of Genewise and Exonerate. Owing to the small number of proteins (some of which aligned in the same location) an additional layer of gene structures was added by projection of human genes. The high-quality annotation of the human genome and the high degree of similarity between the human and chimpanzee genomes enables us to identify genes in chimpanzee by transfer of human genes to the corresponding location in chimp.

The protein-coding transcripts of the human gene structures are projected through the Whole Genome Alignment (WGA) onto the chromosomes in the chimp genome. Small insertions/deletions that disrupt the reading-frame of the resultant transcripts are corrected for by inserting "frame-shift" introns into the structure.

For some human exons and parts of exons, the corresponding chimp sequence is missing from the assembly. In most of these cases, the missing exon is omitted from the chimpanzee gene model. In a small number of cases however, where BLASTZ has aligned the human sequence to a gap in the chimp sequence, the exon is placed in the gap, resulting on a run of X's of the correct length in the translation. Some human transcripts fail to transfer cleanly (due to, for example, missing alignment in the othologous regions). We have attempted to recover these using Exonerate. The single best exonerate alignment to chimp is chosen for each "missing" human transcript, and transcripts with less that 90% identity to the source or 60% coverage of the source are discarded.

The final data set includes 18746 protein coding genes.

RNA-seq data was provided by Henrik Kaessmann from the University of Lausanne, after gene annotation was complete. These data were aligned to the genome using BWA. We used our in-house RNA-seq pipeline to build gene models based on these alignments. Both the original BAM files and the additional gene models can be viewed in LocationView. The RNA-seq based gene models have not yet been added into the default chimpanzee gene set.

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyCHIMP2.1.4, INSDC Assembly GCA_000001515.4, Feb 2011
Database version75.214
Base Pairs2,995,917,117
Golden Path Length3,309,577,922
Genebuild byEnsembl
Genebuild methodFull genebuild
Genebuild startedMay 2011
Genebuild releasedDec 2011
Genebuild last updated/patchedNov 2012

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

18,759
Short non coding genes

Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

8,681
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

572
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.29,160

Other

Genscan gene predictions56,687
Short Variants1,729,238

InterPro Hits