Zebrafish assembly and gene annotation

Assembly

The zebrafish genome project is a collaboration between the Sanger Institute and the zebrafish community, announced during the Sanger Institute Zebrafish Workshop 2000 and was started in February 2001.

Zv9 (GCA_000002035.2) is the ninth integrated assembly of the zebrafish genome. This assembly is used by UCSC to create their danRer7 database. It is based on nearly 90% clone sequence (data freeze April 2010), with remaining gaps being filled using sequence from a novel whole genome shotgun assembly, WGS31. The project coordination, genome sequencing and assembly is provided by the Wellcome Trust Sanger Institute.

An overview of the assembly is available here, and frequently asked questions about the assembly process and terminology are addressed here.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000002035.2

Gene annotation

The zebrafish Zv9 assembly was annotated using a modified Ensembl pipeline. Predictions from zebrafish proteins have been given priority over predictions from other non-mammalian vertebrate species. All Uniprot proteins were filtered to remove predictions (PE level 3 and above). Aligned zebrafish cDNAs have been used to add UTR regions.8,374 RNASeq models made from a range of zebrafish developmental stages and tissues were added into the gene build where they added a novel model or splice variant.Genes are named based on the alignment of their coding regions to known entries in public databases; ZFIN genes have priority in this process.

The Ensembl annotations were then merged with Vega annotations at the transcript level. Transcripts were merged if they shared the same internal exon-intron boundaries (i.e. had identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Vega source transcripts (regardless of merge status) were included in the final merged gene set.

Vega logo Additional manual annotation of this genome can be found in Vega

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyZv9 (The Danio rerio Sequencing Project assembly Zv9), INSDC Assembly GCA_000002035.2, Apr 2010
Database version76.9
Base Pairs1,505,581,940
Golden Path Length1,412,464,843
Genebuild byEnsembl
Genebuild methodMixed strategy build
Genebuild startedJun 2010
Genebuild releasedNov 2010
Genebuild last updated/patchedFeb 2014

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

26,459 (incl. 32 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Small non coding genes4,431
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

2,583 (incl. 5 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

264
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.56,754

Other

Genscan gene predictions36,628
Short Variants1,454,332