Sheep assembly and gene annotation

Assembly

The sheep (Ovis aries) genome was produced by the International Sheep Genome Consortium (ISGC). A single Texel ewe and a single Texel ram were sequenced using Illumina technology. The assembly is based on the Texel ewe data set. The Texel ram data set and Roche 454 reads from the previous assembly v1.0 (ACIV000000000) were used to fill in the gaps. 39,042 SNP markers and Ovine SNP50 genotyping linkage data were used to check scaffold integrity and to anchor scaffolds and super-scaffolds to chromosomes.

The assembly comprises 5,697 toplevel sequences from 130,765 contigs, 27 chromosomes (including the X chromosome). The N50 of the contigs is 40.4 kb and the N50 of the scaffolds is 100.1 Mb. The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000298735.1

Gene annotation

The gene set was built using a mixed approach. The similarity pipeline was used to generate 66,797 models from orthologous vertebrate proteins from UniProtKB. The RNASeq pipeline used 8.2 billion paired-end reads provided by the ISGC. The RNASeq data contains samples from a trio (ram, ewe and lamb), 7 tissue types from the reference sheep and samples from different breeds.

We pooled the tissues to avoid creating too many fragmented models. Using the RNASeq pipeline, we created 19,604 models from the pooled set. When a pooled model was missing but we had a consensus within the tissues models, the consensus model was added to the pooled set, which brought the number of RNASeq models up to 25,832. By combining the orthologous set, the RNASeq set and our ncRNA pipeline we built the final gene set: 20,921 protein coding gene models, 291 pseudogenes and 3,985 short non-coding RNA.

RNASeq data set

In addition to the main set, we have predicted gene models for each tissue type using the RNASeq pipeline. We did a BLASTp of these models against UniProt proteins of protein existence level 1 and 2 in order to confirm the open reading frame (ORF). The best BLAST hit is displayed as a transcript supporting evidence.

The tissue-specific sets of transcript models built using our RNAseq pipeline are as follows:

TissueNumber of gene models
Ewe kidney medulla8350
Ewe abomasum8694
Ewe adrenal gland8447
Ewe alveolar macrophages7747
Ewe cerebellum7714
Ewe cervix6305
Ewe colon9576
Ewe corpus luteum8096
Ewe heart ventricle7256
Ewe liver8301
Ewe lung9402
Ewe lymph node mesenteric8340
Ewe mammary gland9466
Ewe muscle biceps7650
Ewe muscle long dorsal7185
Ewe omentum8081
Ewe ovary8077
Ewe peyers patch9227
Ewe pituitary7957
Ewe placenta membranes7703
Ewe rectum8980
Ewe rumen8856
Ewe skin side8795
Ewe thyroid gland8167
Ewe uterus9071
Lamb abomasum8606
Lamb adrenal gland7813
Lamb caecum7770
Lamb cerebellum8405
Lamb cerebrum8977
Lamb cervix8924
Lamb colon8310
Lamb hypothalamus9009
Lamb kidney cortex8416
Lamb kidney medulla9115
Lamb lung9029
Lamb lymph node mesenteric8315
Lamb lymph node prescapular8913
Lamb mammary gland9924
Lamb muscle biceps7633
Lamb muscle long dorsal7155
Lamb omentum8085
Lamb ovarian follicles8026
Lamb ovary8729
Lamb peyers patch9758
Lamb pituitary gland8702
Lamb rectum8546
Lamb rumen8655
Lamb skin back8629
Lamb spleen8444
Lamb thyroid gland8307
Lamb uterus9259
Lamb ventricle8175
Ram kidney cortex8398
Ram kidney medulla8754
Ram abomasum mucosa9003
Ram adrenal gland8116
Ram alveolar macrophages7676
Ram brain stem9277
Ram caecum9616
Ram cerebellum8662
Ram cerebrum9034
Ram colon8839
Ram duodenum9068
Ram hypothalamus9153
Ram liver7674
Ram lung9041
Ram lymph node mesenteric7857
Ram lymph node prescapular7018
Ram muscle biceps6392
Ram muscle long dorsal6073
Ram omentum8640
Ram pituitary gland8627
Ram rectum8925
Ram rumen8519
Ram skin back8631
Ram spleen8619
Ram testes epididymis8805
Ram testes10891
Ram thyroid gland7817
Ram tonsil8477
Ram ventricle8058
Whole embryo10965
Reference kidney6422
Reference brain5737
Reference heart5379
Reference liver5396
Reference lung7487
Reference ovarian7112
Reference white adipose6755
Merino skin5418
Merged19604

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyOar_v3.1, INSDC Assembly GCA_000298735.1, Aug 2012
Database version76.31
Base Pairs2,534,344,180
Golden Path Length2,619,054,388
Genebuild byEnsembl
Genebuild methodMixed strategy build
Genebuild startedDec 2012
Genebuild releasedDec 2013
Genebuild last updated/patchedDec 2013

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

20,921
Small non coding genes

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, tRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

3,985
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

291
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.27,099

Other

Genscan gene predictions43,449
Short Variants32,960,089