EnsemblEnsembl Home

Variant Effect Predictor Filtering results


The filter_vep.pl script is included along side the main VEP script. It can be used to filter VEP output files down to find important or interesting results.

It operates on either standard or VCF formatted output (NB only VCF output produced by the VEP or in the same format can be used).

Running the script

Run the script as follows:

perl variant_effect_predictor.pl -i in.vcf -o out.txt -cache -everything
perl filter_vep.pl -i out.txt -o out_filtered.txt -filter "[filter_text]"

The script can also read from STDIN and write to STDOUT, and so may be used in a UNIX pipe:

perl variant_effect_predictor.pl -i in.vcf -o stdout -cache -check_existing | perl filter_vep.pl -filter "not Existing_variation" -o out.txt

The above command removes known variants from your output


Options

Flag Alternate Description
--help
-h
Print usage message and exit
--input_file [file]
-i
Specify the input file (i.e. the VEP results file). If no input file is specified, the script will attempt to read from STDIN. Input may be gzipped - to force the script to read a file as gzipped, use --gz
--format [format]
  Specify input file format (vep or vcf)
--output_file [file]
-o
Specify the output file to write to. If no output file is specified, the script will write to STDOUT
--force_overwrite
  Force the script to overwrite the output file if it already exists
--filter [filters]
-f
Add filter (see below). Multiple --filter flags may be used, and are treated as logical ANDs, i.e. all filters must pass for a line to be printed
--list
-l
List allowed fields from the input file
--count
-c
Print only a count of matched lines
--only_matched
  In VCF files, the CSQ field that contains the consequence data will often contain more than one "block" of consequence data, where each block corresponds to a variant/feature overlap. Using --only_matched will remove blocks that do not pass the filters. By default, the script prints out the entire VCF line if any of the blocks pass the filters.
--ontology
-y
Use Sequence Ontology to match consequence terms. Use with operator "is" to match against all child terms of your value. e.g. "Consequence is coding_sequence_variant" will match missense_variant, synonymous_variant etc. Requires database connection; defaults to connecting to ensembldb.ensembl.org. Use --host, --port, --user, --password, --version as per variant_effect_predictor.pl to change connection parameters.

Writing filters

Filter strings consist of three components:

  1. Field : A field name from the VEP results file. This can be any field in the "main" columns of the output, or any in the "Extra" final column. For VCF files, this is any field defined in the "##INFO=<ID=CSQ" header. You can list available fields using --list
  2. Operator : The operator defines the comparison carried out.
  3. Value : The value to which the content of the field is compared.

Examples:

# match entries where Feature (Transcript) is "ENST00000307301"
--filter "Feature is ENST00000307301"

# match entries where Protein_position is less than 10
--filter "Protein_position < 10"

# match entries where Consequence contains "stream" (this will match upstream and downstream)
--filter "Consequence matches stream"

For certain fields you may only be interested in whether it is defined; in this case the operator and value can be left out:

# match entries where the gene symbol is defined
--filter "SYMBOL"

Filter strings can also be linked together by the logical operators "or" and "and", and inverted by prefixing with "not":

# filter for missense variants in CCDS transcripts where the variant falls in a protein domain
--filter "Consequence is missense_variant and CCDS and DOMAINS"

# find variants where the MAF is greater than 10% in either AFR or ASN populations
--filter "AFR_MAF > 0.1 or ASN_MAF > 0.1"

# filter out known variants
--filter "not Existing_variation"

For fields that contain string and number components, the script will try and match the relevant part based on the operator in use. For example, using --sift b in the VEP gives strings that look like "tolerated(0.46)". This will give a match to either of the following filters:

# match string part
--filter "SIFT is tolerated"

# match number part
--filter "SIFT < 0.5"

For the Consequence field it is possible to use the Sequence Ontology to match terms ontologically; for example, to match all coding consequences (e.g. missense_variant, synonymous_variant):

--ontology --filter "Consequence is coding_sequence_variant"

Operators

  • is (synonyms: = , eq) : Match exactly
    # get only transcript consequences
    --filter "Feature_type is Transcript"
  • != (synonym: ne) : Does not match exactly
    # filter out tolerated SIFT predictions
    --filter "SIFT != tolerated"
  • match (synonyms: matches , re , regex) : Match string using regular expression. You may include any regular expression notation, e.g. "\d" for any numerical character
    # match stop_gained, stop_lost and stop_retained
    --filter "Consequence match stop"
  • < (synonym: lt) : Less than
    # find SIFT scores less than 0.1
    --filter "SIFT < 0.1"
  • > (synonym: gt) : Greater than
    # find variants not in the first exon
    --filter "Exon > 1"
  • <= (synonym: lte) : Less than or equal to
  • >= (synonym: gte) : Greater than or equal to
  • exists (synonyms: ex , defined) : Field is defined - equivalent to using no operator and value
  • in : Find in list or file. Value may be either a comma-separated list or a file containing values on separate lines. Each list item is compared using the "is" operator.
    # find variants in a list of gene names
    --filter "SYMBOL in BRCA1,BRCA2"
    
    # filter using a file of MotifFeatures
    --filter "Feature in /data/files/motifs_list.txt"