Zebrafish assembly and gene annotation

Assembly

After the previous zebrafish assembly (Zv9) was released in July 2010, the zebrafish genome sequence was given into the care of the GRC for future improvement and maintenance. The GRC have recently produced GRCz10, which is the tenth zebrafish reference assembly.

The previous assembly (Zv9), although of high quality, featured many gaps and suffered from sub-optimal long-range continuity. In order to overcome this, the GRC have sequenced more than 1500 additional BAC and fosmid clones and added them to the assembly.

The most notable changes in the chromosome landscape since the previous assembly can be found on chromosome 4, which has gained about 15 Mb in length. Also, 94 of 112 previously unplaced clone-contigs have now been placed on a chromosome. The current assembly contains a total sequence length of 1.37Gb with 2.09Mb of gaps. There are 26 chromosomes (including the mitochondrion) and 3,399 scaffolds, composed of 22,852 contigs with a scaffold N50 of 2.18Mb and a contig N50 of 1.232Mb. The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer.

More information about zebrafish research can be found at the Wellcome Trust Sanger Institute.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000002035.3

Other assemblies

Gene annotation

The Ensembl GRCz10 assembly was annotated using Ensembl's automatic annotation pipeline. Predictions from zebrafish proteins have been given priority over predictions from other non-mammalian vertebrate species. All Uniprot proteins were filtered to remove predictions (PE level 3 and above). Aligned zebrafish cDNAs and zebrafish RNASeq data have been used to add UTRs. RNASeq data from embryonic and olfactory epithelium tissues were also used to produce gene models. Genes are named based on the alignment of their coding regions to known entries in public databases; ZFIN genes have priority in this process.

The Ensembl annotations were then merged with Vega annotations at the transcript level. Transcripts were merged if they shared the same internal exon-intron boundaries (i.e. had identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Vega source transcripts (regardless of merge status) were included in the final merged gene set.

Vega logo Additional manual annotation of this genome can be found in Vega

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyGRCz10 (Genome Reference Consortium Zebrafish Build 10), INSDC Assembly GCA_000002035.3, Sep 2014
Database version80.10
Base Pairs1,464,443,456
Golden Path Length

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

1,371,719,383
Genebuild byEnsembl
Genebuild methodMixed strategy build
Genebuild startedSep 2014
Genebuild releasedMay 2015
Genebuild last updated/patchedMay 2015

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

25,642 (incl 46 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Non coding genes6,008
Small non coding genes

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline. Please note that tRNAs are annotated separately using tRNAscan. tRNAs are included as 'simple fetaures', not genes, because they are not annotated using aligned sequence evidence.

3,172
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

2,741 (incl 7 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Misc non coding genes95
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

293
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.57,369

Other

Genscan gene predictions36,087
Short Variants17,502,082
Structural variants5,841