EnsemblEnsembl Home

Ensembl BlastView Configuration

Contents

  1. Introduction
  2. Registering Sequence Databases
  3. Registering Search Methods
  4. Registering Method vs. Database Links
  5. Configuring the ENSEMBL BLAST Database
  6. Conclusion

Introduction

This document provides details for configuring the Ensembl BlastView web interface.

Ensembl Configuration Overview

The Ensembl distribution is usually stored within a filesystem directory (hereafter refered to as $ENSEMBL ROOT), e.g.:

$ setenv ENSEMBL_ROOT
/usr/local/ensembl

Ensembl web site configuration data are stored in text files within $ENSEMBL ROOT, in the following directory;

$ENSEMBL_ROOT/conf

Species-independent configuration are stored in (hereafter refered to as MULTI.ini):

$ENSEMBL_ROOT/conf/ini-files/MULTI.ini

Species-default configuration are stored in (hereafter refered to as DEFAULTS.ini):

$ENSEMBL_ROOT/conf/ini-files/DEFAULTS.ini

Species-specific configuration are stored in (hereafter refered to as <SPECIES>.ini), for example:

$ENSEMBL_ROOT/conf/ini-files/Homo_sapiens.ini
$ENSEMBL_ROOT/conf/ini-files/Danio_rerio.ini

The .ini files are separated into sections containing key-value pairs using the format:

[SECTION_HEADING] key1 = value1 key2 = value2

BlastView Configuration

BlastView configuration consists of the following components:

  • Register of sequence databases to search against
  • Register of methods (executables) that run the search
  • Register of search method vs. sequence database linkage
  • Location of MySQL database for search report storage (ENSEMBL BLAST)

Databases and methods are generally species independent, and their configurations are therefore stored in the DEFAULTS.ini file. Method vs. sequence database linkage, however, is species specific, and stored in the <SPECIES>.ini files.

For interface responsiveness search reports are parsed once and the results stored in a database. The location of this database is configured in the MULTI.ini file.

Registering Sequence Databases

Sequence database configuration requires the following data;

  • A "type" string, used for grouping of databases across species and methods, (e.g. CDNA ALL, PEP ALL)
  • A "label" string, used for display in the interface, (e.g. "Ensembl cDNAs", "Ensembl Peptides").

To register databases, type/label pairs must be entered in the BLAST DATASOURCES configuration section of MULTI.ini. An additional DEFAULT key sets the type of the default database e.g.:

[BLAST_DATASOURCES]
; Registers blast datasources. Key values are used as labels
; Keys should be registered against methods in species.ini files
DEFAULT         = LATESTGP
CDNA_ALL        = cDNAs
CDNA_ABINITIO   = Ab-initio cDNAs (Genscan/SNAP)
PEP_ALL         = Peptides
PEP_ABINITIO    = Ab-initio Peptides (Genscan/SNAP)
RNA_NC          = Ensembl Non-coding RNA genes
LATESTGP        = Genomic sequence
LATESTGP_MASKED = Genomic sequence (masked)

Registering Search Methods

Within the context of BlastView, a search method is an algorithm/executable that takes a query sequence of a particular type (DNA/peptide), and a sequence database of a particular type (DNA/peptide), and computes some measure of sequence similarity between the two. This means that BLASTN, TBLASTX, BLASTP, BLAT etc. are entirely seperate at the configuration level. Before use with BlastView, code must be written to "wrap" each executable in a perl module with a "runnable" interface. These wrapper modules hide the differences in calling and report handling between methods from the BlastView interface code. Coding of these wrappers for different methods/system configurations is covered in the BlastView tech document. The key attributes of BlastView methods are therefore:

  • A "type" string used for grouping methods across species, also used as human-readable label
  • A "module" arrayref used to identify the appropriate wrapper(s) class.

To register methods, type/module pairs must be entered in the ENSEMBL BLAST METHODS configuration section of MULTI.ini e.g.:

BLASTN   = [blast dna dna ensembl_wublastn]  ; alternatives: ncbiblastn,  ensembl_wublastn, sge_wublastn
BLASTX   = [blast dna peptide ensembl_wublastx]  ; alternatives: ncbiblastx,  ensembl_wublastx, sge_wublastx
BLASTP   = [blast peptide peptide ensembl_wublastp]  ; alternatives: ncbiblastp,  ensembl_wublastp, sge_wublastp
TBLASTN  = [blast peptide dna ensembl_wutblastn] ; alternatives: ncbitblastn, ensembl_wutblastn, sge_wutblastn
TBLASTX  = [blast dna dna ensembl_wutblastx] ; alternatives: ncbitblastx, ensembl_wutblastx, sge_wutblastx
BLAT     = [blat dna dna blat_gfclient]

For example, the ensembl wublastn wrapper contains the logic to run a wu-blastn search using the Compaq BSUB job submission system as used for the Ensembl Blast cluster. Conversly, the blat_gfclient wrapper runs BLAT searches over TCP-IP using a client-server model. Further wrappers are being developed to, for example, run blast searches on the same machine as the web server.

Unfortunately, it is not currently possible to override the blast method wrappers on a per-species basis. This is a known limitation of the system, and may be addressed in the future.

In addition to the ENSEMBL BLAST METHODS section, there are several attributes in the general section of DEFAULTS.ini that also affect method configuration. These are:

# Path to binaries on local machine 
ENSEMBL_BINARIES_PATH           = /usr/local/bin 
# Path to binaries on remote machine
ENSEMBL_BLAST_BIN_PATH          = /usr/remote/bin 
# Path to blast databases
ENSEMBL_BLAST_DATA_PATH         = /usr/remote/data 
# Path to blast filter
directory ENSEMBL_BLAST_FILTER  = /usr/remote/blast/filter 
# Path to blast matrix directory 
ENSEMBL_BLAST_MATRIX            = /usr/remote/blast/matrix
# Path to RepeatMasker executable 
ENSEMBL_REPEATMASKER            = /usr/remote/RepeatMasker 
# Path to BLAST servers
ENSEMBL_BLAST_SERVERS           = [ blast-01 blast-02 ]

If you have created your blast indices with standardised file names, there is no need to configure these individually. However if you are using BLAT, you do need to configure it in the corresponding <Genus_species>.ini file. files.

Registering non-standard indices

If you have not named your files with the standard format Genus_species.ASSEMBLY.<release number>.<type>.<set>.fa,

e.g. Homo_sapiens.GRCh37.65.dna.seqlevel.fa

each of the method types registered in MULTI.ini (e.g. BLASTN, TBLASTX etc.) must have a corresponding section named [<METHOD>_DATASOURCES] in the <Genus_species>.ini e.g. [BLASTN_DATASOURCES].

The [<METHOD>_DATASOURCES] section contains:

  1. A DATASOURCE_TYPE = dna or DATASOURCE_TYPE = peptide key=value pair to specify the query type (dna/peptide) that the search method expects as input (see example below).
  2. <DATABASE> = <LOCATOR> (KEY = value) pairs which are links to one or more sequence databases.
    • <DATABASE> is one of the database types registered in DEFAULTS.ini (e.g. CDNA_ALL, LATESTGP, PEP_KNOWN)
    • <LOCATOR> refers to the filesystem or TCP-IP location of the database. For the blast databases the <LOCATOR> can either be the full name of the file (see example below), or the file name can be replaced with <DATABASE> = %_ e.g.
      LATESTGP      = %_
      

      In the latter case, the file name will be autogenerated on server start up to use the files with the name Genus_species.Assembly.Release.sequencetype.subset.fa

Example:

[BLASTN_DATASOURCES] 
DATASOURCE_TYPE = dna 
LATESTGP      = ProjectX.dna.fa  
CDNA_ALL      = Project.cdna_all.fa 
CDNA_ABINITIO = Project.cdna_abinitio.fa

Configuring the ENSEMBL BLAST Database

To improve interface responsiveness, search reports are parsed once, and the results stored in a database. The location of this database is configured in the MULTI.ini file. Configuration is similar to that of all other Ensembl databases, with the name of the database being ENSEMBL BLAST. The main difference between ENSEMBL BLAST and other Ensembl databases, however, is that ENSEMBL BLAST is a read/write database, so the configured database user must have write permission. For example:

[databases] 
DATABASE_BLAST = ensembl_blast 
[DATABASE_BLAST] 
HOST = localhost 
PORT = 3306 
USER = admin_user 
PASS = secret

The database schema of the ENSEMBL BLAST database is distributed within the perl code, rather than available by FTP download. Firstly, an empty database should be created. E.g.

$ mysql -u admin_user -p secret -e "create database
ensembl_blast"

Next, a script can be run that creates the database automatically. E.g.

utils/utils/blast_database.pl

The correct execution of this script can be checked as follows:

$ mysqldump -u admin_user -p secret
ensembl_blast

The above should result in output similar to the following. Note that the blast result, blast hit and blast hsp tables are timestamped. Re-running the blast database.pl script will cause the blast result, blast hit and blast hsp tables to be rotated, meaning that, whilst old results/hsps/hits will still be available, new searches will be stored in the new tables. It is simple, therefore, to maintain the DATABASE BLAST database by dropping old tables. See the utils/blast cleaner.pl script for an example of how this i/s done.

-- -- Table structure for table "blast_table_log" -- CREATE TABLE
blast_table_log ( table_id int(10) unsigned NOT NULL auto_increment,
table_name varchar(32) default NULL, table_type
enum("TICKET","RESULT","HIT","HSP") default NULL, table_status
enum("CURRENT","FILLED","DELETED") default NULL, use_date date default
NULL, create_time datetime default NULL, delete_time datetime default
NULL, num_objects int(10) default NULL, PRIMARY KEY (table_id), KEY
table_name (table_name), KEY table_type (table_type), KEY use_date
(use_date), KEY table_status (table_status) ) TYPE=MyISAM;

-- -- Table structure for table "blast_ticket" -- CREATE TABLE
blast_ticket ( ticket_id int(10) unsigned NOT NULL auto_increment,
create_time datetime NOT NULL default "0000-00-00 00:00:00",
update_time datetime NOT NULL default "0000-00-00 00:00:00", ticket
varchar(32) NOT NULL default "", object longblob, PRIMARY KEY
(ticket_id), UNIQUE KEY ticket (ticket), KEY create_time
(create_time), KEY update_time (update_time) ) TYPE=MyISAM;

-- -- Table structure for table "blast_result20030821" -- CREATE TABLE
blast_result20030821 ( result_id int(10) unsigned NOT NULL
auto_increment, ticket varchar(32) default NULL, object longblob,
PRIMARY KEY (result_id), KEY ticket (ticket) ) TYPE=MyISAM; CREATE
TABLE blast_hit20030821 ( hit_id int(10) unsigned NOT NULL
auto_increment, ticket varchar(32) default NULL, object longblob,
PRIMARY KEY (hit_id), KEY ticket (ticket) ) TYPE=MyISAM;

-- -- Table structure for table "blast_hsp20030821" -- CREATE TABLE
blast_hsp20030821 ( hsp_id int(10) unsigned NOT NULL auto_increment,
ticket varchar(32) default NULL, object longblob, chr_name varchar(32)
default NULL, chr_start int(10) unsigned default NULL, chr_end int(10)
unsigned default NULL, PRIMARY KEY (hsp_id), KEY ticket (ticket) )
TYPE=MyISAM MAX_ROWS=705032704 AVG_ROW_LENGTH=4000;

Conclusion

By following the above steps, the BlastView interface should be available for use on the next restart of the Ensembl web server. Nowever, if the configuration .ini files have been changed, the following file should be deleted before server restart (this is a filesystem cache of the config): $ENSEMBL_ROOT/conf/config.packed

For further details about how BlastView works, please see the BlastView technical documentation: