Microarray Probe Mapping

Expression microarrays are used to measure the expression levels of large numbers of genes simultaneously. Ensembl annotates expression microarrays on the reference genome and transcripts sequences for those arrays whose manufacturers disclose the probe sequences in a public archive. Our annotation pipeline comprises of two phases that are summarised in Figure 1 and 2 respectively, along with the intermediate results produced.

What are probes and probesets?

In expression arrays a probe (or oligoprobe) is a short DNA sequence targeting a short region of a transcript. They are used to detect the presence of nucleotide sequences through hybridisation to single-stranded nucleic acid due to complementarity between the probe and the target. Typically, probes are 25 bp to 60 bp long. In Affymetrix arrays, probes are grouped into probesets which are designed to target the same transcript with multiple measurements.

First phase: genome and transcript sequence alignment

Each probe sequence is aligned to both the reference genome and transcript sequences, with the latter comparison allowing us to capture probes that span introns and would be otherwise missed by the genomic alignment step. Alignment is performed using Exonerate (Slater et al., 2005) and the resulting hits are filtered to keep only those that constitute a perfect match across the whole probe sequence, while the rest are being discarded. Probes that align to 100 or more loci are considered promiscuous and are stored in our regulation databases as unmapped objects. For those transcriptomic hits that fulfill the above criteria we perform an extra step where the the transcriptomic hit coordinates are converted (‘projected’) to the equivalent genomic coordinates. Finally, both genomic and the projected transcriptomic hits are stored in our databases as probe features using the extended cigar format.

Probe mapping pipeline first phase
Figure 1: Probe mapping pipeline first phase - from probe sequences to genomic alignments (probe features)

Second phase: Mapping probes and probesets to transcripts

The goal of this phase is to map microarray probes and probesets (when appropriate) to Ensembl transcripts (Figure 2). Initially, the genomic coordinates of the probe features generated in phase one are compared to the genomic coordinates of the full set of transcripts, including their UTRs (“extended transcripts”). In cases where annotated UTRs are absent a default UTR length is used. This is calculated for both 5’ and 3’ UTRs as the highest of either the mean or the median of all annotated UTRs for a given species.

A probe feature is linked to a transcript if it almost completely overlaps (allowing for only 1 bp of non overlap) with an exon or UTR region. Probe features that fail to fulfill this criterion are stored as unmapped objects. Successful probe feature to transcript mappings are then used to draw the connections between probes and transcripts and stored as probe transcript mappings.

Finally, for arrays that contain probesets, we consider all transcripts to which the probes of a probeset have been mapped to (see previous step). If less than half of its probes have been assigned to a transcript, the mapping is considered insufficient and stored as an unmapped object. If, based on this criterion, the overall number of transcripts to which the probeset would be mapped exceeds 100, then the probeset is considered promiscuous and all mappings that were considered are stored as unmapped objects. Otherwise, the successful probeset to transcript mappings are stored in the database.

Probe mapping pipeline second phase
Figure 2: Probe mapping pipeline second phase - mapping probe features, probes and probesets to transcripts

Data visualisation and access

Probe features can be displayed in the 'Region in detail' view in the Ensembl browser, while probe to transcript mappings can be seen in the 'Oligo probes' view, accessible via the transcript page.

The probeset to transcript mappings generated from the Ensembl probe mapping pipeline are also available through BioMart. These data are currently incorporated into the main Ensembl genes mart under the 'Microarray' section in the 'Attributes' panel.

Programmatic access to our probe mapping pipeline results is available through either our Perl or REST API.

The transcript annotations generated from the Ensembl array mapping pipeline are also available in BioMart. These data are currently incorporated into the main Ensembl genes mart, see the 'Microarray' section in the 'Attributes' panel.

Annotated arrays in Ensembl

A number of commercially available arrays are mapped to Ensembl transcripts and genomes. For a list of those available, use our REST API. If you would like to see a commercially available array that is not listed and the probe sequences are available in a public archive, please contact us.