Microarray Probe Mapping
Ensembl annotates expression microarrays on the genome sequences if manufacturers disclose probe sequences for a given micro array. The mapping process is a two-step procedure.
Step One: Genome/Transcript Sequence Alignment
In the first step individual probes (oligonucleotides) are mapped to both the genome sequence and the cDNA sequence. Transcript alignments are performed to capture probes which span introns. All alignments are stored with reference to the genome sequence i.e. transcript alignments are reconstituted as gapped alignments against the genome. Alignments are stored as ProbeFeatures using the extended cigar format as defined by the SAMTools group. Alignments are performed using the Ensembl analysis pipeline, implementing the Exonerate sequence comparison and alignment tool (Slater et al., 2005). A default 1 bp mismatch is permitted between the probe and the genome sequence assembly. Probes that match at 100 or more locations (e.g. suspected Alu repeats) are discarded and not stored in the database.
Step Two: Ensembl Transcript Annotation
In the second step, we aim to associate microarray probes or probe sets with Ensembl transcript predictions (ENST...) using the ProbeFeatures generated from step one. For arrays with probe sets (e.g. Affymetrix®) it is normally required that more than 50% of the probes in a probe set hit a given transcript sequence. Probe set sizes are determined dynamically on a per probe set basis, rather than taking the array-wide value documented by the manufacturer. Arrays which do not contain probesets as part of their design have transcript annotations assigned directly to individual probes.
A ProbeFeature is matched to a transcript if it overlaps with an exon or UTR region with a minimum of 1bp mismatch. To account for conservative UTR estimation, transcript cDNA sequences are extended by the length of the UTR. Where annotated UTRs are absent a default UTR length is used, calculated for both five and three prime UTRs as the highest of either the mean or the median of all annotated UTRs for a given species.
In the Ensembl browser, individual probe alignments from step one can be displayed in the 'Region in detail' view. Probes that match to a transcript can be seen in the 'Oligo probes' view, accessible via the transcript page.
The probe mappings and transcript annotations now reside in the functional genomics(funcgen) database. As such
programatic access requires the use of the ensembl-funcgen API.
POD documentation is available here:
Probe and ProbeSet level transcript annotations are now stored in the funcgen databases, along with information on individual ProbeFeatures and objects which fail the mapping criterion are stored as UnmappedObjects. An an example script for access to these data is available here:
The transcript annotations generated from the Ensembl array mapping pipeline are also available in BioMart. These data are currently incorporated into the main Ensembl genes mart, see the 'Microarray' section in the 'Attributes' panel.
Running The Pipeline
Fancy running your own custom array array through the Ensembl array mapping pipeline? Further documentation about the efg array mapping environment can be found here: