The Ensembl Regulatory Build on human
The Ensembl Regulatory Build (Zerbino et al. 2015) provides a genome-wide set of regions that are likely to be involved in gene regulation. The classification of these regions are:
- Predicted promoters
- Predicted promoter flanking regions
- Predicted enhancers
- CTCF binding sites
- Unannotated transcription factor binding sites
- Unannotated open chromatin regions
Segmentation and annotation of segmentation states
We start by running a segmentation across all possible cell-types. Because of heterogeneous datasets, we allow for multiple segmentations to be run separately on different sets of cells. Each segmentation annotates the genomes of its designated cell-types with a fixed number of states, which are generally identified with a number.
For each state of each segmentation, we create a summary track which represents the number of cell-types that have that state at any given base pair of the genome. The overlaps of this summary function with known features (transcription start sites, exons) and experimental features (CTCF binding sites, known chromatin repression marks) are used to assign a preliminary label to that state. For practical purposes, this annotation is manually curated. The labels used are either one of the above functional labels, or non-functional labels (dead,weak or repressed).
Defining the MultiCell regulatory features
We first determine the a cell type independent functional annotation of the genome, referred to as the MultiCell Regulatory Build. This build defines the function of genomic regions.
To determine whether a state is useful in practice, it is compared to the overall density of transcription factor binding (as measured with ChIP-Seq). Applying increasing integer cutoffs to this signal, we define progressively smaller regions. If these regions reach a 2 fold enrichment in transcription factor binding signali, then the state is retained for the build. This means that although all states are annotated, not all are used to build the Regulatory Build.
For any given segmentation, we define initial regions. For every functional label, all the state summaries that were assigned that labelled and judged informative are summed into a single function. Using the overall TF binding signal as true signal, we select the threshold which produces the highest F-score.
We then merge the regulatory features across segmentations by annotation.
Some simplifications are applied a posteriori:
- Distal enhancers which overlap promoter flanking regions are merged into the latter.
- Promoter flanking regions which overlap transcription start sites are incorporated into the flanking regions of the latter features.
In addition to the segmentation states, which are essentially derived from histone marks, we integrate independent experimental evidence:
- Transcription factor binding sites which were observed through ChIP-Seq but are covered by none of the newly defined features are added to the Build.
- Open chromatin regions which were experimentally observed but covered by none of the above annotations, are also added to the Build.
Cell type specific annotations
The cell type specific regulatory features are identical to the MultiCell ones in position and classification but have an added activity annotation. Currently this activity state is purely binary (on/off) although this could be extended to finer annotations (poised, repressed, information not available, ...).
For each cell type and each functional annotation, we check whether there is segmentation state or experimental evidence which could be used to test the activity of this annotation. If this evidence exists, then all MultiCell features with that label are annotated as on or off by simple overlap analysis. If this evidence is not available, these features are not represented in the cell type specific annotation (Note: this could change in the near future).
The Regulatory Build for non-human species
Because not all species are studied as intensively, segmentation software cannot always be run on the available data. The preliminary Ensembl Regulatory Build provides a genome-wide set of regions that are likely to be involved in gene regulation. The classification of these regions are:
- Promoter associated
- Gene associated
- PolIII associated
- Non-gene associated
Each region is constructed from experimental data that describes chromatin activity or state (e.g. DNase1, Histone modifications, Polymerase) and transcription factor binding following a four-step analysis :
- Genomic regions characterised by DNA protein binding or open chromatin across all cell types are defined as Regulatory Features;
- Combinations of epigenetic marks and transcription factor binding sites (TFBS) are associated with one of the classification above (e.g. Promoter associated, unclassified, etc) for specific cell lines;
- For each cell type, any Regulatory Feature that overlaps with a combination of epigenetic markers and/or TFBS is assigned the appropriate classification;
- The Regulatory Features associated to TF binding are annotated with the corresponding TF binding motifs. For example, a Regulatory Feature that overlaps with a CTCF binding feature is annotated with all the potential CTCF binding sites that it contains.
Regulatory Feature Construction
First we define focus experimental marks, which narrow the analysis on focused regions on the genome. These marks were chosen because they are widespread across the transcriptional machinery. They include DNase1 hypersensitivity (which indicates region of open chromatin), CTCF (which characterises insulator elements) and other transcription factor binding marks.
Next, we define the core regions where focus marks overlap. These are likely to be positioned on or around any potential regulatory motif. To maintain resolution and to avoid chaining of regulatory features across regions of dense regulatory elements, a 2kb length threshold is used. If a feature exceeds this cut-off, it is treated as an attribute feature (see below) rather than a focus feature, and does not contribute to the core region structure.
Once the core regions define generic MultiCell regulatory features, they are then extended into cell type specific attribute Regulatory Features. Attribute features integrate long-range marks (typically histone modifications) and are useful for classification. For a given cell type, if a core region overlaps with a focus marker, it is extended into an attribute feature. Overlapping and nearby (<2kb) marks are integrated into the feature, thus defining the bounds of the attribute region around the core region.
For a cell line where there is no core data available, but there is substantial attribute data present, the core regions defined by the other cell lines are projected onto it. These regions are extended into attribute features as detailed above.
Regulatory Feature Annotation
We first define a set of Genomic Features, which broadly characterise the function of a genomic region into:
- Genes: gene bodies
- Promoters: within 2.5kb of a transcription start site, but not in the downstream gene body,
- Non-gene: any region outside of the gene bodies,
- Polymerase III sites: regions 2.5kb upstream of PolIII transcribed regions, e.g. tRNAs.
Cell-type specific attribute features are classified according to the pattern or combination of marks (e.g TFBS and/or histone modifications) integrated into them (see above). For each pattern that occurs in at least 100 Regulatory Features, we count the number of times these features overlap with each class of Genomic Feature, and compute the same for a random set of regions with the same lengths. If more than 50% of the Regulatory Features overlap a particular class of Genomic Feature and the chi-squared statistic with respect to the random regions is greater than 8.0 (P<0.005) we record that this pattern is associated with this class of Genomic Feature. If associated, we test whether longer patterns (i.e. containing extra marks) also overlap with the same genomic feature. If fewer than 50% of those Regulatory Features overlap, we record that this second pattern is not associated with that class of Genomic Feature.
Having determined which pattern is associated or not to a given class of Genomic Feature, we look at each Regulatory Feature and use those patterns to associate it to a Genomic Feature. During this process it is possible for a given Regulatory Feature to be associated with more than one class of Genomic Feature. This is particularly the case where experimental marks are densely clustered. To resolve these conflicts, a priority rule is observed.
We only classify cell-type specific regulatory features. We do not classify the multi-cell type set (i.e. all cell types combined) as different cell types may give conflicting signals reflecting their unique combination of regulatory and transcriptional states.
Transcription Factor Binding Site Annotation
For each transcription factor (TF), which has both a ChIP-seq data set in the funcgen database, and a publicly available position weight matrix (PWM), we annotated the position of putative TF binding sites within the ChIP-seq peaks.
PWMs are taken from JASPAR (Bryen et al., 2008) and mapped to the genome using the find_pssm_dna program from the MOODS software (Korhonen et al., 2009) with the -f flag set and a permissive threshold of 0.001. We then filter these mappings using a log odds score threshold. The threshold is derived per PWM by considering the occurrence of mappings in a sample of randomly positioned 'background' sequences, matched in terms of size and chromosome to the ChIP-seq peaks for a given TF. We select the threshold such that the proportion of these background peaks containing a mapping is approximately 5%. Only mappings that overlap the corresponding ChIP-seq peaks are included in the functional genomics database.
PWM features (also known as MotifFeatures) are visualised in the browser as black boxes within regulatory features and TF evidence peaks. Sometimes these boxes are very thin and appear as lines. Clicking on the black box will highlight specific information in the pop-up menu, such as the matching score (relatively to the optimal site) among others, as shown below.