Showing tool doc from version 4.1.0.0 | The latest version is 4.1.3.0

PathSeqBuildKmers

Builds set of host reference k-mers

Category Metagenomics


Overview

Produce a set of k-mers from the given host reference. The output file from this tool is required to run the PathSeq pipeline.

The tool works by scanning the reference one position at a time. It takes the k-mer (k-base subsequence) starting at each consecutive position and adds it to a set. By default, the set is stored as a hash table.

Users also have the option to represent the k-mers set using a Bloom filter by specifying a non-zero value for the --bloom-false-positive-probability parameter. This uses less memory than the default hash set but also can produce false positives. In other words, when asked whether a non-host k-mer exists in the set, it will incorrectly say yes with a probability, p. The user can specify p so that the probability of incorrectly subtracting a non-host read is negligibly small. For p = 0.0001 and read length of 151 bases, the probability of the PathSeq incorrectly subtracting a non-host read is < 1.5%, but the amount of memory used is reduced 4-fold compared to a hash table. For this reason, Bloom filters are generally recommended.

Note that the file formats used for storing these k-mer data structures are only readable by the PathSeq tools.

Input

  • An indexed host reference in FASTA format

Output

  • A set of the k-mers in the reference

Usage examples

Builds a hash table of every k-mer (k = 31) in the reference. Each k-mer is masked at the 16th position.

 gatk PathSeqBuildKmers  \
   --reference host_reference.fasta \
   --output host_reference.hss \
   --kmer-mask 16 \
   --kmer-size 31
 

Builds a Bloom filter with false positive probability p < 0.001.

 gatk PathSeqBuildKmers  \
   --reference host_reference.fasta \
   --output host_reference.hss \
   --bloom-false-positive-probability 0.001 \
   --kmer-mask 16 \
   --kmer-size 31
 

Notes

For most references, the Java VM will run out of memory with the default settings. The Java heap size limit should be set at least 20x the size of the reference (less if building a Bloom filter). For example, for a 3 GB reference set the limit to 60 GB by adding --java-options "-Xmx60g" to the command.

PathSeqBuildKmers specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--output
 -O
null File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)
--reference
 -R
null Reference FASTA file path on local disk
Optional Tool Arguments
--arguments_file
[] read one or more arguments files and add them to the command line
--bloom-false-positive-probability
 -P
0.0 If non-zero, creates a Bloom filter with this false positive probability
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--gcs-project-for-requester-pays
"" Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
--help
 -h
false display the help message
--kmer-mask
 -M
"" Comma-delimited list of base indices (starting with 0) to mask in each k-mer
--kmer-size
 -SZ
31 K-mer size, must be odd and less than 32
--kmer-spacing
 -SP
1 Spacing between successive k-mers
--version
false display the version number for this tool
Optional Common Arguments
--gatk-config-file
null A configuration file to use with the GATK.
--QUIET
false Whether to suppress job-summary info on System.err.
--tmp-dir
null Temp directory to use.
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--bloom-false-positive-probability / -P

If non-zero, creates a Bloom filter with this false positive probability

Note that the provided argument is used as an upper limit on the probability, and the actual false positive probability may be less.

double  0.0  [ [ 0  0.001 ]  1 ] ]


--gatk-config-file / NA

A configuration file to use with the GATK.

String  null


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--gcs-project-for-requester-pays / NA

Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.

String  ""


--help / -h

display the help message

boolean  false


--kmer-mask / -M

Comma-delimited list of base indices (starting with 0) to mask in each k-mer
K-mer masking allows mismatches to occur at one or more specified positions. Masking the middle base is recommended to enhance host read detection.

String  ""


--kmer-size / -SZ

K-mer size, must be odd and less than 32
Reducing the k-mer length will increase the number of host reads subtracted in the filtering phase of the pipeline, but it may also increase the number of non-host (i.e. microbial) reads that are incorrectly subtracted. Note that changing the length of the k-mer does not affect memory usage.

int  31  [ [ 1  31 ] ]


--kmer-spacing / -SP

Spacing between successive k-mers
The k-mer set size can be reduced by only storing k-mers starting at every n bases in the reference. By default every k-mer, starting at consecutive bases in the reference, is stored.

int  1  [ [ 1  ∞ ] ]


--output / -O

File for k-mer set output. Extension will be automatically added if not present (.hss for hash set or .bfi for Bloom filter)

R String  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--reference / -R

Reference FASTA file path on local disk

R String  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--tmp-dir / NA

Temp directory to use.

String  null


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.1.0.0 built at Wed, 30 Jan 2019 10:21:04 +0530.