EnsemblEnsembl Home

Variant Effect Predictor Filtering results


The filter_vep script is included along side the main VEP script. It can be used to filter VEP output files to find important or interesting results.

It operates on standard, tab-delimited or VCF formatted output (NB only VCF output produced by VEP or in the same format can be used).

Running filter_vep

Run the script as follows:

./vep -i in.vcf -o out.txt -cache -everything
./filter_vep -i out.txt -o out_filtered.txt -filter "[filter_text]"

The script can also read from STDIN and write to STDOUT, and so may be used in a UNIX pipe:

./vep -i in.vcf -o stdout -cache -check_existing | ./filter_vep -filter "not Existing_variation" -o out.txt

The above command removes known variants from your output


Options

Flag Alternate Description
--help
-h
Print usage message and exit
--input_file [file]
-i
Specify the input file (i.e. the VEP results file). If no input file is specified, the script will attempt to read from STDIN. Input may be gzipped - to force the script to read a file as gzipped, use --gz
--format [format]
  Specify input file format (vep or vcf)
--output_file [file]
-o
Specify the output file to write to. If no output file is specified, the script will write to STDOUT
--force_overwrite
  Force the script to overwrite the output file if it already exists
--filter [filters]
-f
Add filter (see below). Multiple --filter flags may be used, and are treated as logical ANDs, i.e. all filters must pass for a line to be printed
--list
-l
List allowed fields from the input file
--count
-c
Print only a count of matched lines
--only_matched
  In VCF files, the CSQ field that contains the consequence data will often contain more than one "block" of consequence data, where each block corresponds to a variant/feature overlap. Using --only_matched will remove blocks that do not pass the filters. By default, the script prints out the entire VCF line if any of the blocks pass the filters.
--ontology
-y
Use Sequence Ontology to match consequence terms. Use with operator "is" to match against all child terms of your value. e.g. "Consequence is coding_sequence_variant" will match missense_variant, synonymous_variant etc. Requires database connection; defaults to connecting to ensembldb.ensembl.org. Use --host, --port, --user, --password, --version as per vep to change connection parameters.

Writing filters

Filter strings consist of three components:

  1. Field : A field name from the VEP results file. This can be any field in the "main" columns of the output, or any in the "Extra" final column. For VCF files, this is any field defined in the "##INFO=<ID=CSQ" header. You can list available fields using --list. Field names are not case sensitive, and you may use the first few characters of a field name if they resolve uniquely to one field name.
  2. Operator : The operator defines the comparison carried out.
  3. Value : The value to which the content of the field is compared. May be prefixed with "#" to represent the value of another field.

Examples:

# match entries where Feature (Transcript) is "ENST00000307301"
--filter "Feature is ENST00000307301"

# match entries where Protein_position is less than 10
--filter "Protein_position < 10"

# match entries where Consequence contains "stream" (this will match upstream and downstream)
--filter "Consequence matches stream"

For certain fields you may only be interested in whether a value exists for that field; in this case the operator and value can be left out:

# match entries where the gene symbol is defined
--filter "SYMBOL"

The value component may be another field; to represent this, prefix the name of the field to be used as a value with "#":

# match entries where AFR_AF is greater than EUR_AF
--filter "AFR_AF > #EUR_AF"

Filter strings can be linked together by the logical operators "or" and "and", and inverted by prefixing with "not":

# filter for missense variants in CCDS transcripts where the variant falls in a protein domain
--filter "Consequence is missense_variant and CCDS and DOMAINS"

# find variants where the allele frequency is greater than 10% in either AFR or EUR populations
--filter "AFR_AF > 0.1 or EUR_AF > 0.1"

# filter out known variants
--filter "not Existing_variation"

Filter logic may be constrained using parentheses, to any arbitrary level:

# find variants with AF > 0.1 in AFR or EUR but not EAS or SAS
--filter "(AFR_AF > 0.1 or EUR_AF > 0.1) and (EAS_AF < 0.1 and SAS_AF < 0.1)"

For fields that contain string and number components, the script will try and match the relevant part based on the operator in use. For example, using --sift b in VEP gives strings that look like "tolerated(0.46)". This will give a match to either of the following filters:

# match string part
--filter "SIFT is tolerated"

# match number part
--filter "SIFT < 0.5"

Note that for numeric fields, such as the *AF allele frequency fields, filter_vep does not consider the absence of a value for that field as equivalent to a 0 value. For example, if you wish to find rare variants by finding those where the allele frequency is less than 1% or absent, you should use the following:

--filter "AF < 0.01 or not AF"

For the Consequence field it is possible to use the Sequence Ontology to match terms ontologically; for example, to match all coding consequences (e.g. missense_variant, synonymous_variant):

--ontology --filter "Consequence is coding_sequence_variant"

Operators

  • is (synonyms: = , eq) : Match exactly
    # get only transcript consequences
    --filter "Feature_type is Transcript"
  • != (synonym: ne) : Does not match exactly
    # filter out tolerated SIFT predictions
    --filter "SIFT != tolerated"
  • match (synonyms: matches , re , regex) : Match string using regular expression. You may include any regular expression notation, e.g. "\d" for any numerical character
    # match stop_gained, stop_lost and stop_retained
    --filter "Consequence match stop"
  • < (synonym: lt) : Less than. Note an absent value is not considered to be equivalent to 0.
    # find SIFT scores less than 0.1
    --filter "SIFT < 0.1"
  • > (synonym: gt) : Greater than
    # find variants not in the first exon
    --filter "Exon > 1"
  • <= (synonym: lte) : Less than or equal to. Note an absent value is not considered to be equivalent to 0.
  • >= (synonym: gte) : Greater than or equal to
  • exists (synonyms: ex , defined) : Field is defined - equivalent to using no operator and value
  • in : Find in list or file. Value may be either a comma-separated list or a file containing values on separate lines. Each list item is compared using the "is" operator.
    # find variants in a list of gene names
    --filter "SYMBOL in BRCA1,BRCA2"
    
    # filter using a file of MotifFeatures
    --filter "Feature in /data/files/motifs_list.txt"