Datasets and Data Processing
Regulatory features are generated by using a variety of genome-wide data sets, mostly derived from ChIP-seq data on chromatin structure, epigenomic, and transcription factor binding. Metadata and links to the original data sources can be visualised via the the various experiment views:
At the bottom of these views, the results table also shows the mappings between transcription factors, Ensembl genes and Jaspar matrices that have been used in the regulatory build.
To maintain a standardised peak calling methodology, we start our analyses with raw reads from each experiment. We align reads (replicates are pooled) to the genome using BWA (with default parameters). All matches to the mitochondrial DNA are filtered out to avoid alignment anomalies due to similarities with autosomal regions.
Peak calling is performed using two algorithms:
- SWEMBL (S. Wilder et al., in preparation)
- Strict parameters (-f 150 -R 0.015) are used for most experiments. These were obtained using CTCF as a reference dataset.
- Less stringent settings (-f 150 -R 0.0025) are used for data that has a broader distribution of reads. This is generally used with DNase1 datasets to enable a greater set of potential regulatory regions.
- CCAT (Xu et al., 2010)
- The algorithm described by Xu et al, is specifically designed for peak calling of broad features, such as H3K36me3. The parameters for histone calling were set using a 1kb window.
The ENCODE DAC Blacklist regions are then used to filter the resultant peaks. These regions are stored in the core database as 'encode_excluded' MiscFeatures.