EnsemblEnsembl Home

Import VCF script

Introduction

The import_vcf.pl script populates an Ensembl database from a VCF (Variant Call Format) file. A description of the VCF file format can be found on the 1000 Genomes project website. The script can either populate a database from scratch, or add data to an existing database.

The script can be found in the ensembl-variation git repo, in the scripts/import/ sub-directory.

Please note this document assumes a reasonable degree of familiarity with Ensembl and its internal workings. It may not be suitable for some external users. For any help with this topic, please email the ensembl-dev mailing list.

[Back to top]

Running the script

The script is run on the command line as follows:

 perl import_vcf.pl [options] 

where [options] represent a set of flags and options to the script. These can be listed using the flag --help:

 perl import_vcf.pl --help 

Options can also be read from a configuration file using --config.

Basic options

Flag Alternate Description
--help
  Display help message and quit
--input_file
-i
Input file name. If not specified, the script will attempt to read from STDIN. Input file may be compressed with gzip or bgzip.
--registry [registry file]
  Registry file containing DB connection details. See this document for details. Required
--species [species]
  Species to use for import. Must be either the lowercase, underscore-separated latin name, or an alias for the species added in the registry file. Default = "homo_sapiens"
--source [source name]
  Name for source of variants - written to the source table. Required
--population [population]
  Name for population to which all individuals in the file belong. Required (or use --panel)

Advanced options

Flag Alternate Description
--quiet
-q
Suppress status and warning messages
--progress_update [n]
  Update the progress status after each n variants. This also determines how often recovery state is written to disk. To set the recovery state frequency to 1 without overloading your output with progress messages, add --no_progress Default = 100
--no_progress
  Don't show progress messages. Progress messages shown by default
--config [filename]
  Load configuration options from a config file. The config file should consist of whitespace-separated pairs of option names and settings e.g.:
species       mus_musculus
ind_prefix    MYMICE_
registry      ensembl.registry
--tmpdir [dir]
  Directory in which to store temporary files. Only used if writing to compressed_genotype_region.
--tmpfile
  Name for temporary file used by MySQL import. Only used if writing to compressed_genotype_region.
--source [source name]
  Name for source of variants - written to the source table. Required
--source_description [description]
  Optional text for source description field.
--population [population]
  Name for population to which all individuals in the file belong. Required (or use --panel)
--panel [panel file]
  Panel file containing individual population membership. One or more of --population or --panel is required. Frequencies are calculated for each population specified. Individuals may belong to more than one population. Panel file is tab-delimited, with the format
IND1    POP1
IND2    POP1
IND3    POP2
IND4    POP3
Required (or use --population)
--pedigree [pedigree file]
  Pedigree file containing family relationships and individual genders.
--only_existing
  Only write to tables when an existing variant is found. Existing can be a variation with the same name, or a variant with the same location and alleles.
--gmaf [ALL|population]
  Add global allele frequency data to variation table. "--gmaf ALL" uses all individuals in the file; specifying any other population name will use the selected population for the GMAF.
--ind_prefix [prefix]
  Add prefix to individual names.
--pop_prefix [prefix]
  Add prefix to population names.
--var_prefix [prefix]
  Add prefix to constructed variation names.
--create_name
  Always create a new variation name i.e. don't use ID column. Names are constructed from CHROM and POS fields of VCF.
--flank [prefix]
  Size of flanking sequence entry. Default = 200
--gp
  Use GP tag from INFO column to get coords instead of CHROM and POS fields. Script uses CHROM and POS by default
--tables [list]
  Comma-separated list of tables to include when writing to DB. Overwrites default list.
--add_tables [list]
  Comma-separated list of tables to add to default list. Use to add e.g. compressed_genotype_region, transcript_variation.
--skip_tables [list]
  Comma-separated list of tables to exclude when writing to DB. Takes precedence over --tables (i.e. any tables named in --tables and --skip_tables will be skipped).
--fork [n]
  Fork off n simultaneous processes, each dealing with one chromosome from the input file. If the number of chromosomes is fewer than the number of forks, the input file will be scanned up front and the chromosomes sub-divided into regions. Input file must be bgzipped and tabix indexed. 10 processes is usually optimal.
--test [n]
  Run in test mode on first n lines of file. No database writes are done, and any that would be done are output as status messages. Cannot be used with --fork. Not used by default
--recover
  Attempt to recover an incomplete session. See recovery section for more details.
--recover_pos [pos|chr:pos]
  Force recover from chromosomal position. Given as either "chr:pos" or "pos" alone.
--recover_point [md5]
  Force recover from a position in the file given by the md5 hash of the line content (without newline character).
--no_recover
  Disable session recovery - this will result in a slight speed increase while running.
--sql
  Specify SQL file to create tables. Usually found in the ensembl-variation git repo, as sql/tables.sql.
--coord_system [coord system name]
  If the seq_region table is not populated, by default the script will attempt to copy seq_region entries from a Core database specified in the registry file. The seq_region entries from the selected coord_system will be copied and used. Default = "chromosome"
--backup
  Backup all affected tables before import.
--move
  Move all affected tables to backed up names and replace with empty tables.

[Back to top]


Populations and frequencies

If your VCF file contains individual data, these individuals must belong to at least one population. You may specify either one population that encompasses all individual entries in the file (using --population), or a panel file that contains individual population relationships (using --panel). If you use both, then individuals will be added to any populations specified in the panel file, as well as the "global" population specified with --population.

Frequencies for alleles and genotypes are calculated and stored in the allele and population_genotype tables respectively. Frequencies are calculated on the fly from the genotypes present in the file. Missing genotypes are not included in the total count from which the frequency is derived. Frequencies are calculated for each population specified independently.

Note that a frequency of 0 for an allele is considered valid and will be entered in the allele table. A frequency of 0 will not be entered for a population_genotype, however, since the requirements on the population_genotype table and API are slightly different.

To avoid calculating the frequencies and populating these tables, you may specify "--skip_tables allele,population_genotype" (or some variant therein to avoid only one). It may be that you have only a small population for which population_genotype frequencies are not relevant. You should always (unless you know what you are doing, or for example are loading on top of an existing DB) populate the allele table, since it is assumed that each variation entry will have at least two corresponding alleles with entries in the allele table.

Populations and individuals are only written once - if a population or individual is found with the same name as the one you have specified, this one is loaded and used, rather than a new one being created.

[Back to top]


Default tables

The script by default populates the "basic" set of tables required for a functioning variation database. For a full description of all tables in the variation schema, see this page. The following tables are populated by default:

  • allele
  • allele_code
  • compressed_genotype_var
  • flanking_sequence
  • genotype_code
  • individual
  • individual_population
  • meta_coord
  • population
  • population_genotype
  • sample
  • source
  • variation
  • variation_feature
  • variation_synonym

You may stop the script populating, for example, the population_genotype table using "--skip_tables population_genotype", but you should always be aware of what the consequences of this are before proceeding through a lengthy import!

There are a set of dependencies built into the script that should prevent you skipping any tables that are crucial for the use of other tables (for example, you may not skip allele_code if you want to populate allele).

To see which tables the script will be writing to, do a dry run using --test and add the --backup flag.

Two tables are skipped by default that may be required. Add them by passing e.g. "--add_tables transcript_variation".

transcript_variation

The transcript_variation table is not populated by default, as populating this table depends on calculating the consequence of variants relative to features in the Core database, which can take a long time. It may also be faster to do this once the VCF import is finished using the standard transcript_variation pipeline.

compressed_genotype_region

The compressed_genotype_region table is used to store genotypes in a per-sample, per-genomic region style. Its contents duplicate that of the compressed_genotype_var table, but it is used in a different way. The table is used for resequencing views, and also for the calculation of LD data on the fly by the API. Currently the designation applies to all samples in the file, so if you have a file that contains individuals you don't want to be included for either function then you should split it and import the two separately.

For any data added to the compressed_genotype_region table, you will need to manually edit the entries in the sample table such that the display field indicates the relevant data type. If you intend the data to be used for LD calculation, you should change the population sample entry so that the display field is "LD". If you intend the data to be used as resequencing data, you should change the individuals' sample entries so that the display field is one of "DEFAULT" (switched on the web display by default) or "DISPLAYABLE" (can be switched on using configure this page). You may also need to spoof read_coverage entries for any strains you have, unless you have genuine read coverage data in the read_coverage table. This is as simple as adding an entry to read_coverage for each strain's sample_id and each chromosome or seq_region, with a start and end covering the region covered by the resequencing (this may be the whole chromosome).

[Back to top]


Importing from scratch

The script can be used to create a variation database from scratch using a VCF file. By adding the transcript_variation table as described above, it is possible to create a website-ready variation DB using this script alone.

SQL

You must create your database yourself using MySQL:

CREATE DATABASE my_var_db;

Then set up you registry file as described here to point the variation DB entry to your newly created database (make sure you write the entry as a Bio::EnsEMBL::Variation::DBSQL::DBAdaptor, as opposed to a Bio::EnsEMBL::DBSQL::DBAdaptor) . If you want to do transcript_variation, you should ensure you also have the relevant core DB for your species set up in the registry file.

The script can create the tables using the table.sql file located in the ensembl-variation module (located in the sql subdirectory); simply add e.g. "--sql ../../table.sql" if running the script from the scripts/import subdirectory. You can also do this manually outside of running the script.

seq_region

The script requires that the seq_region table be populated so that it can retrieve seq_region identifiers for each of the chromosomes in your VCF file. If present, the script will attempt to copy these from the core database. By default, the script assumes that your data are on the "chromosome" coord_system; if they are on a different coord_system, you must specify which using --coord_system.

attrib

In order to calculate variation classes and consequences, the various attrib tables must be populated. This may be done manually using the script create_attrib_sql.pl in ensembl-variation/scripts/misc, or automatically by the VCF import script if --sql is used. If it fails, the script will give a warning but will not fail; it is possible to fix this after import by populating the attrib tables and running the transcript_variation/variation class pipeline.

To be fully website-ready, you should populate transcript_variation either using the script (with --add_tables transcript_variation), or using the transcript_variation/variation class pipeline.

[Back to top]


Importing into an existing database

The script may also be used to add data to an existing database. For example, the Ensembl Variation team uses the script to add genotypes from the 1000 Genomes project into the existing dbSNP-derived human variation database.

Merging variants

The script will attempt to "merge" variants that occur at the same location. Insertions and deletions are not considered for merging. Consider two variants, rs1 and rs2:

chr1    50     rs1     G     A
chr1    50     rs2     G     T

The first variant, rs1, will be added as normal, with entries added to variation, variation_feature etc. The second variant, rs2, will be merged into rs1. This means in practice that no new variation or variation_feature entry will be created for rs2. Instead, the variation_feature entry for rs1 will be updated to include rs2's alleles in its allele_string field; it will change from G/A to G/A/T. rs2 will be added to variation_synonym as a synonym for rs1 (so, for example, searching for rs2 on the web would find the page for rs1 with rs2 listed as a synonym). Any linked database entries for rs2, such as those in the allele table, will be added linked to rs1.

Some more examples of variants that would be merged:

# same position, same alleles
chr1    100     rs1     C     T
chr1    100     rs2     C     T

# multi-bp substitution
chr1    100     rs1     CG     TT
chr1    100     rs2     CG     TC

And some examples of variants that would not be merged:

# same position, one SNP, one deletion
chr1    101     rs1     C     T
chr1    100     rs2     CT    C

Adding to existing variants

Use the --only_existing flag to add data only to existing variants; no new variants will be added by the script.

[Back to top]


Test mode

Test mode can be used to check what database changes will be made by running the script with a certain set of parameters. The script will print a status message detailing each SQL process that it will make, without actually executing the SQL. This can be useful to check the following things:

  • populations are being added correctly
  • frequencies are being calculated correctly
  • the expected set of tables is being written to (use with --backup to see the list)

[Back to top]


Recovering unfinished sessions

The import VCF script can recover crashed or otherwise unfinished sessions. To do this, the script stores a status token in a file every n lines (where n is the parameter set with --progress_update). The token contains a MD5 hash of the contents of the last line that the script processed. The token is stored in $HOME/.import_vcf/ with a filename that is another MD5 hash, this time derived from the composite string of the command-line options passed to the script.

When you set the script to recover with --recover, the script checks for the recovery token (you must pass the same command-line arguments as the session you want to recover, obviously with --recover added), then iterates through the input until it reaches the line whose hash matches the one stored in the recovery token. Since the script will have stopped somewhere between two recovery states, it must take some time here to ensure, for example, that duplicate entries are not added to the allele or genotype tables.

There is a very small time penalty for writing the recovery token every n lines; it may be disabled for a marginal increase in speed using --no_recover.

Manual recovery

It is also possible to force the script to start from a given position in the file using --recover_pos; this does not depend on the script finding the recovery token file. You may also use --recover_point and pass the MD5 of the line from which you wish to recover.

[Back to top]


Forking

The speed of import can be vastly increased by running simultaneous processes. You could do this manually by, for example, running multiple copies of the script on several different VCF files.

Using --fork takes on this work for you, by splitting your input file into chunks and running a separate CPU thread on each chunk. Tabix is used to chunk the file, so you must have the tabix utility in your path to use this. Your input file must also be bgzipped and tabix indexed.

The script first checks which chromosomes are listed in the index using "tabix -l"; if the number of chromosomes is less than the number of forks you have requested (e.g. your VCF may contain data for only one chromosome), the script will then do an initial scan through the file to determine the range of positions in the file so that it can be divided into even chunks. This process will take a few minutes on very large files.

The script will then run each of the forks, with up to the number of forks you specified running simultaneously.

Recovery

Using --fork is still compatible with --recover, as each fork writes its own recovery token file. It can only be used with the default --recover mode; it will not work correctly with --recover_pos or --recover_point.

[Back to top]