Showing docs for version 3.6-0 | The latest version is 4.1.4.0


SelectVariants

Select a subset of variants from a larger callset

Category Variant Manipulation Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose.

There are many different options for selecting subsets of variants from a larger callset:

  • Extract one or more samples from a callset based on either a complete sample name or a pattern match.
  • Specify criteria for inclusion that place thresholds on annotation values, e.g. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These criteria are written as "JEXL expressions", which are documented in the article about using JEXL expressions.
  • Provide concordance or discordance tracks in order to include or exclude variants that are also present in other given callsets.
  • Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian violation, filtering status, allelicity, and so on.

There are also several options for recording the original values of certain annotations that are recalculated when a subsetting the new callset, trimming alleles, and so on.

Input

A variant call set from which to select a subset.

Output

A new VCF file containing the selected subset of variants.

Usage examples

Select two samples out of a VCF with many samples

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -sn SAMPLE_A_PARC \
   -sn SAMPLE_B_ACTG
 

Select two samples and any sample that matches a regular expression

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_PARC \
   -sn SAMPLE_1_ACTG \
   -se 'SAMPLE.+PARC'
 

Exclude two samples and any sample that matches a regular expression:

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -xl_sn SAMPLE_1_PARC \
   -xl_sn SAMPLE_1_ACTG \
   -xl_se 'SAMPLE.+PARC'
 

Select any sample that matches a regular expression and sites where the QD annotation is more than 10:

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC' \
   -select "QD > 10.0"
 

Select any sample that does not match a regular expression and sites where the QD annotation is more than 10:

 java  -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC' \
   -select "QD > 10.0"
   -invertSelect
 

Select a sample and exclude non-variant loci and filtered loci (trim remaining alleles by default):

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_ACTG \
   -env \
   -ef
 

Select a sample, subset remaining alleles, but don't trim:

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_ACTG \
   -env \
   -noTrim

Select a sample and restrict the output vcf to a set of intervals:

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -L /path/to/my.interval_list \
   -sn SAMPLE_1_ACTG
 

Select all calls missed in my vcf, but present in HapMap (useful to take a look at why these variants weren't called in my dataset):

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V hapmap.vcf \
   --discordance myCalls.vcf \
   -o output.vcf \
   -sn mySample
 

Select all calls made by both myCalls and theirCalls (useful to take a look at what is consistent between two callers):

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V myCalls.vcf \
   --concordance theirCalls.vcf \
   -o output.vcf \
   -sn mySample
 

Generating a VCF of all the variants that are mendelian violations. The optional argument '-mvq' restricts the selection to sites that have a QUAL score of 50 or more

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -ped family.ped \
   -mv -mvq 50 \
   -o violations.vcf
 

Generating a VCF of all the variants that are not mendelian violations. The optional argument '-mvq' together with '-invMv' restricts the selection to sites that have a QUAL score of 50 or less

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -ped family.ped \
   -mv -mvq 50 -invMv \
   -o violations.vcf
 

Create a set with 50% of the total number of variants in the variant VCF:

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -fraction 0.5
 

Select only indels between 2 and 5 bases long from a VCF:

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -selectType INDEL
   --minIndelSize 2
   --maxIndelSize 5
 

Exclude indels from a VCF:

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   --selectTypeToExclude INDEL
 

Select only multi-allelic SNPs and MNPs from a VCF (i.e. SNPs with more than one allele listed in the ALT column):

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
   -selectType SNP -selectType MNP \
   -restrictAllelesTo MULTIALLELIC
 

Select IDs in fileKeep and exclude IDs in fileExclude:

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -IDs fileKeep \
   -excludeIDs fileExclude
 

Select sites where there are between 2 and 5 samples and between 10 and 50 percent of the sample genotypes are filtered:

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   --maxFilteredGenotypes 5
   --minFilteredGenotypes 2
   --maxFractionFilteredGenotypes 0.60
   --minFractionFilteredGenotypes 0.10
 

Set filtered genotypes to no-call (./.):

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   --setFilteredGtToNocall
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by SelectVariants.

Parallelism options

This tool can be run in multi-threaded mode using this option.


Command-line Arguments

Engine arguments

All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

SelectVariants specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--variant
 -V
NA Input VCF file
Optional Inputs
--concordance
 -conc
none Output variants also called in this comparison track
--discordance
 -disc
none Output variants not called in this comparison track
--exclude_sample_expressions
 -xl_se
[] List of sample expressions to exclude
--exclude_sample_file
 -xl_sf
[] List of samples to exclude
--sample_file
 -sf
NA File containing a list of samples to include
Optional Outputs
--out
 -o
stdout File to which variants should be written
Optional Parameters
--exclude_sample_name
 -xl_sn
[] Exclude genotypes from this sample
--excludeIDs
 -xlIDs
NA List of variant IDs to select
--keepIDs
 -IDs
NA List of variant IDs to select
--maxFilteredGenotypes
2147483647 Maximum number of samples filtered at the genotype level
--maxFractionFilteredGenotypes
1.0 Maximum fraction of samples filtered at the genotype level
--maxIndelSize
2147483647 Maximum size of indels to include
--maxNOCALLfraction
1.0 Maximum fraction of samples with no-call genotypes
--maxNOCALLnumber
2147483647 Maximum number of samples with no-call genotypes
--mendelianViolationQualThreshold
 -mvq
0.0 Minimum GQ score for each trio member to accept a site as a violation
--minFilteredGenotypes
0 Minimum number of samples filtered at the genotype level
--minFractionFilteredGenotypes
0.0 Maximum fraction of samples filtered at the genotype level
--minIndelSize
0 Minimum size of indels to include
--remove_fraction_genotypes
 -fractionGenotypes
0.0 Select a fraction of genotypes at random from the input and sets them to no-call
--restrictAllelesTo
ALL Select only variants of a particular allelicity
--sample_expressions
 -se
NA Regular expression to select multiple samples
--sample_name
 -sn
[] Include genotypes from this sample
--select_random_fraction
 -fraction
0.0 Select a fraction of variants at random from the input
--selectexpressions
 -select
[] One or more criteria to use when selecting the data
--selectTypeToExclude
 -xlSelectType
[] Do not select certain type of variants from the input file
--selectTypeToInclude
 -selectType
[] Select only a certain type of variants from the input file
Optional Flags
--excludeFiltered
 -ef
false Don't include filtered sites
--excludeNonVariants
 -env
false Don't include non-variant sites
--forceValidOutput
false Forces output VCF to be compliant to up-to-date version
--invertMendelianViolation
 -invMv
false Output non-mendelian violation sites only
--invertselect
 -invertSelect
false Invert the selection criteria for -select
--keepOriginalAC
false Store the original AC, AF, and AN values after subsetting
--keepOriginalDP
false Store the original DP value after subsetting
--mendelianViolation
 -mv
false Output mendelian violation sites only
--preserveAlleles
 -noTrim
false Preserve original alleles, do not trim
--removeUnusedAlternates
 -trimAlternates
false Remove alternate alleles not present in any genotypes
--setFilteredGtToNocall
false Set filtered genotypes to no-call

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--concordance / -conc

Output variants also called in this comparison track
A site is considered concordant if (1) we are not looking for specific samples and there is a variant called in both the variant and concordance tracks or (2) every sample present in the variant track is present in the concordance track and they have the sample genotype call.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

RodBinding[VariantContext]  none


--discordance / -disc

Output variants not called in this comparison track
A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

RodBinding[VariantContext]  none


--exclude_sample_expressions / -xl_se

List of sample expressions to exclude
Using a regular expression allows you to match multiple sample names that have that pattern in common. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded. This argument can be specified multiple times in order to use multiple different matching patterns.

Set[String]  []


--exclude_sample_file / -xl_sf

List of samples to exclude
Sample names should be in a plain text file listing one sample name per line. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded. This argument can be specified multiple times in order to provide multiple sample list files.

Set[File]  []


--exclude_sample_name / -xl_sn

Exclude genotypes from this sample
Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded. This argument can be specified multiple times in order to provide multiple sample names.

Set[String]  []


--excludeFiltered / -ef

Don't include filtered sites
If this flag is enabled, sites that have been marked as filtered (i.e. have anything other than `.` or `PASS` in the FILTER field) will be excluded from the output.

boolean  false


--excludeIDs / -xlIDs

List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will not select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line.

File  NA


--excludeNonVariants / -env

Don't include non-variant sites

boolean  false


--forceValidOutput / NA

Forces output VCF to be compliant to up-to-date version
If this argument is provided, the output will be compliant with the version in the header, however it will also cause the tool to run slower than without the argument. Without the argument the header will be compliant with the up-to-date version, but the output in the body may not be compliant. If an up-to-date input file is used, then the output will also be up-to-date regardless of this argument.

boolean  false


--invertMendelianViolation / -invMv

Output non-mendelian violation sites only
If this flag is enabled, this tool will select only variants that do not correspond to a mendelian violation as determined on the basis of family structure. Requires passing a pedigree file using the engine-level `-ped` argument.

Boolean  false


--invertselect / -invertSelect

Invert the selection criteria for -select
Invert the selection criteria for -select.

boolean  false


--keepIDs / -IDs

List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line.

File  NA


--keepOriginalAC / -keepOriginalAC

Store the original AC, AF, and AN values after subsetting
When subsetting a callset, this tool recalculates the AC, AF, and AN values corresponding to the contents of the subset. If this flag is enabled, the original values of those annotations will be stored in new annotations called AC_Orig, AF_Orig, and AN_Orig.

boolean  false


--keepOriginalDP / -keepOriginalDP

Store the original DP value after subsetting
When subsetting a callset, this tool recalculates the site-level (INFO field) DP value corresponding to the contents of the subset. If this flag is enabled, the original value of the DP annotation will be stored in a new annotation called DP_Orig.

boolean  false


--maxFilteredGenotypes / NA

Maximum number of samples filtered at the genotype level
If this argument is provided, select sites where at most a maximum number of samples are filtered at the genotype level.

int  2147483647  [ [ -∞  ∞ ] ]


--maxFractionFilteredGenotypes / NA

Maximum fraction of samples filtered at the genotype level
If this argument is provided, select sites where a fraction or less of the samples are filtered at the genotype level.

double  1.0  [ [ -∞  ∞ ] ]


--maxIndelSize / NA

Maximum size of indels to include
If this argument is provided, indels that are larger than the specified size will be excluded.

int  2147483647  [ [ -∞  ∞ ] ]


--maxNOCALLfraction / NA

Maximum fraction of samples with no-call genotypes
If this argument is provided, select sites where at most the given fraction of samples have no-call genotypes.

double  1.0  [ [ -∞  ∞ ] ]


--maxNOCALLnumber / NA

Maximum number of samples with no-call genotypes
If this argument is provided, select sites where at most the given number of samples have no-call genotypes.

int  2147483647  [ [ -∞  ∞ ] ]


--mendelianViolation / -mv

Output mendelian violation sites only
If this flag is enabled, this tool will select only variants that correspond to a mendelian violation as determined on the basis of family structure. Requires passing a pedigree file using the engine-level `-ped` argument.

Boolean  false


--mendelianViolationQualThreshold / -mvq

Minimum GQ score for each trio member to accept a site as a violation
This argument specifies the genotype quality (GQ) threshold that all members of a trio must have in order for a site to be accepted as a mendelian violation. Note that the `-mv` flag must be set for this argument to have an effect.

double  0.0  [ [ -∞  ∞ ] ]


--minFilteredGenotypes / NA

Minimum number of samples filtered at the genotype level
If this argument is provided, select sites where at least a minimum number of samples are filtered at the genotype level.

int  0  [ [ -∞  ∞ ] ]


--minFractionFilteredGenotypes / NA

Maximum fraction of samples filtered at the genotype level
If this argument is provided, select sites where a fraction or more of the samples are filtered at the genotype level.

double  0.0  [ [ -∞  ∞ ] ]


--minIndelSize / NA

Minimum size of indels to include
If this argument is provided, indels that are smaller than the specified size will be excluded.

int  0  [ [ -∞  ∞ ] ]


--out / -o

File to which variants should be written

VariantContextWriter  stdout


--preserveAlleles / -noTrim

Preserve original alleles, do not trim
The default behavior of this tool is to remove bases common to all remaining alleles after subsetting operations have been completed, leaving only their minimal representation. If this flag is enabled, the original alleles will be preserved as recorded in the input VCF.

boolean  false


--remove_fraction_genotypes / -fractionGenotypes

Select a fraction of genotypes at random from the input and sets them to no-call
The value of this argument should be a number between 0 and 1 specifying the fraction of total variants to be randomly selected from the input callset and set to no-call (./). Note that this is done using a probabilistic function, so the final result is not guaranteed to carry the exact fraction requested. Can be used for large fractions.

double  0.0  [ [ -∞  ∞ ] ]


--removeUnusedAlternates / -trimAlternates

Remove alternate alleles not present in any genotypes
When this flag is enabled, all alternate alleles that are not present in the (output) samples will be removed. Note that this even extends to biallelic SNPs - if the alternate allele is not present in any sample, it will be removed and the record will contain a '.' in the ALT column. Note also that sites-only VCFs, by definition, do not include the alternate allele in any genotype calls.

boolean  false


--restrictAllelesTo / -restrictAllelesTo

Select only variants of a particular allelicity
When this argument is used, we can choose to include only multiallelic or biallelic sites, depending on how many alleles are listed in the ALT column of a VCF. For example, a multiallelic record such as: 1 100 . A AAA,AAAAA will be excluded if `-restrictAllelesTo BIALLELIC` is used, because there are two alternate alleles, whereas a record such as: 1 100 . A T will be included in that case, but would be excluded if `-restrictAllelesTo MULTIALLELIC` is used. Valid options are ALL (default), MULTIALLELIC or BIALLELIC.

The --restrictAllelesTo argument is an enumerated type (NumberAlleleRestriction), which can have one of the following values:

ALL
BIALLELIC
MULTIALLELIC

NumberAlleleRestriction  ALL


--sample_expressions / -se

Regular expression to select multiple samples
Using a regular expression allows you to match multiple sample names that have that pattern in common. This argument can be specified multiple times in order to use multiple different matching patterns.

Set[String]  NA


--sample_file / -sf

File containing a list of samples to include
Sample names should be in a plain text file listing one sample name per line. This argument can be specified multiple times in order to provide multiple sample list files.

Set[File]  NA


--sample_name / -sn

Include genotypes from this sample
This argument can be specified multiple times in order to provide multiple sample names.

Set[String]  []


--select_random_fraction / -fraction

Select a fraction of variants at random from the input
The value of this argument should be a number between 0 and 1 specifying the fraction of total variants to be randomly selected from the input callset. Note that this is done using a probabilistic function, so the final result is not guaranteed to carry the exact fraction requested. Can be used for large fractions.

double  0.0  [ [ -∞  ∞ ] ]


--selectexpressions / -select

One or more criteria to use when selecting the data
See example commands above for detailed usage examples. Note that these expressions are evaluated *after* the specified samples are extracted and the INFO field annotations are updated.

ArrayList[String]  []


--selectTypeToExclude / -xlSelectType

Do not select certain type of variants from the input file
This argument excludes particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times.

List[Type]  []


--selectTypeToInclude / -selectType

Select only a certain type of variants from the input file
This argument selects particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times.

List[Type]  []


--setFilteredGtToNocall / NA

Set filtered genotypes to no-call
If this argument is provided, set filtered genotypes to no-call (./.).

boolean  false


--variant / -V

Input VCF file
Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file).

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, VCF, VCF3

R RodBinding[VariantContext]  NA


Return to top


See also GATK Documentation Index | Tool Docs Index | Support Forum

GATK version 3.6-0-g89b7209 built at 2017/02/09 12:52:48.