Datasets and Data Processing
Regulatory features are generated by using a variety of genome-wide data sets, mostly derived from ChIP-seq data on chromatin structure, epigenomic, and transcription factor binding. Metadata and links to the original data sources can be visualised via the the various experiment views:
At the bottom of these views, the results table also shows the mappings between transcription factors, Ensembl genes and Jaspar matrices that have been used in the regulatory build.
To maintain a standardised peak calling methodology, we start our analyses with raw reads from each experiment. We align reads (replicates are pooled) to the genome using BWA (with default parameters). All matches to the mitochondrial DNA are filtered out to avoid alignment anomalies due to similarities with autosomal regions.
Peak calling is performed using two algorithms:
- SWEMBL (S. Wilder et al., in preparation):
- When replicates are available, permissive settings (-f 150 -R 0.0005 -d 150) are used, of which we choose the best peaks, so as to fit an IDR score of 0.01. This number of peaks is linearly adjusted to take into accounts replicate pairs with different numbers of peaks.
- Otherwise, different stringent settings are used for very tight peaks (e.g. DNAse1) (-f 150 -R 0.0025) and larger ones (-f 150 -R 0.015).
- CCAT (Xu et al., 2010)
- The algorithm described by Xu et al, is specifically designed for peak calling of broad features, such as H3K36me3. The parameters were set empirically to (fragmentSize 200; slidingWinSize 150; movingStep 20; isStrandSensitiveMode 0; minCount 10; outputNum 100000; randomSeed 123456; minScore 4.0; bootstrapPass 50).
The ENCODE DAC Blacklist regions are then used to filter the resultant peaks. These regions are stored in the core database as 'encode_excluded' MiscFeatures.