Annotation of Non-Coding RNAs
Non-coding RNA Overview
Non-coding RNAs (ncRNAs) are involved in many biological processes and are increasingly seen as important. As is the case with proteins, it is the overall structure of the molecule which imparts function. However, while similar protein structures are often reflected in a conserved amino acid sequence, sequences underlying RNA secondary structure are very variable; this makes ncRNAs difficult to detect using sequence alone.
Because of this, we use a variety of techniques to detect ncRNAs. First, a combination of sensitive BLAST searches are used to identify likely targets, then a covariance model search is used to measure the probability that the targets can fold into the structures required. Other ncRNAs are added as part of the raw compute stage.
The following non-coding RNA gene types are annotated, along with pseudogenes
- transfer RNA
- transfer RNA located in the mitochondrial genome
- ribosomal RNA
- small cytoplasmic RNA
- small nuclear RNA
- small nucleolar RNA
- microRNA precursors
- miscellaneous other RNA
- Long intergenic non-coding RNAs
Most ncRNAs are annotated by aligning genomic sequence against RFAM using BLASTN. The BLAST hits are clustered and filtered by E value and are used to seed Infernal searches of the locus with the corresponding RFAM covariance models. The purpose of this is to reduce the search space required, as to scan the entire genome with all the RFAM covariance models would be extremely CPU-intensive. The resulting BLAST hits are then used as supporting evidence for ncRNA genes.
miRNAs are predicted by BLASTN of genomic sequence slices against miRBase sequences. All species are used. The BLAST hits are clustered and filtered by E value and the aligned genomic sequence is then checked for possible secondary structure using RNAFold. If evidence is found that the genomic sequence could form a stable hairpin structure, the locus is used to create a miRNA gene model. The resulting BLAST hit is used as supporting evidence for the miRNA gene.
Note: The miRNA identifier and name are only associated to the resulting Ensembl miRNA if they are of the same species.
tRNAs are annotated as part of the raw compute process using tRNAscan-SE.
lincRNA (Long intergenic non-coding RNAs) Ensembl gene annotation, cDNA alignments and chromatin-state map data from the Ensembl regulatory build are used to predict lincRNAs for human and mouse. We do not import the lincRNAs identified by Guttman et al , but their publication guided us to our current approach for automatically annotating lincRNAs. First, regions of chromatin methylation (H3K4me3 and H3K36me3) outside known protein-coding loci are identified. Next, cDNAs which overlap with H3K4me3 or H3K36me3 features are identified as candidate lincRNAs. A final evaluation step investigates if each candidate lincRNA has any protein-coding potential. Any candidate lincRNA containing a substantial open reading frame (ORF) covering 35% or more of its length and containing PFAM/tigrfam protein domains will be rejected. Candidate lincRNAs that pass the final evaluation step are included in the human or mouse gene set as lincRNA genes.
Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, Cabili MN, Jaenisch R, Mikkelsen TS, Jacks T, Hacohen N, Bernstein BE, Kellis M, Regev A, Rinn JL, Lander ES
Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals.
Nature 2009 458:223-227