Showing tool doc from version 4.1.3.0 | The latest version is 4.1.3.0

CollectSamErrorMetrics (Picard)

Program to collect error metrics on bases stratified in various ways.

Sequencing errors come in different 'flavors'. For example, some occur during sequencing while others happen during library construction, prior to the sequencing. They may be correlated with various aspect of the sequencing experiment: position in the read, base context, length of insert and so on.

This program collects two different kinds of error metrics (one which attempts to distinguish between pre- and post- sequencer errors, and on which doesn't) and a collation of 'stratifiers' each of which assigns bases into various bins. The stratifiers can be used together to generate a composite stratification.

For example:

The BASE_QUALITY stratifier will place bases in bins according to their declared base quality. The READ_ORDINALITY stratifier will place bases in one of two bins depending on whether their read is 'first' or 'second'. One could generate a composite stratifier BASE_QUALITY:READ_ORDINALITY which will do both stratifications as the same time.

The resulting metric file will be named according to a provided prefix and a suffix which is generated automatically according to the error metric. The tool can collect multiple metrics in a single pass and there should be hardly any performance loss when specifying multiple metrics at the same time; the default includes a large collection of metrics.

To estimate the error rate the tool assumes that all differences from the reference are errors. For this to be a reasonable assumption the tool needs to know the sites at which the sample is actually polymorphic and a confidence interval where the user is relatively certain that the polymorphic sites are known and accurate. These two inputs are provided as a VCF and INTERVALS. The program will only process sites that are in the intersection of the interval lists in the INTERVALS argument as long as they are not polymorphic in the VCF.

Category Diagnostics and Quality Control


Overview

Program to collect error metrics on bases stratified in various ways.

CollectSamErrorMetrics (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--INPUT
 -I
null Input SAM or BAM file.
--OUTPUT
 -O
null Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.
--REFERENCE_SEQUENCE
 -R
null Reference sequence file.
--VCF
 -V
null VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.
Optional Tool Arguments
--arguments_file
[] read one or more arguments files and add them to the command line
--ERROR_METRICS
[ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT] Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.
--ERROR_VALUE
null A fake argument used to show the options of ERROR (in ERROR_METRICS).
--help
 -h
false display the help message
--INTERVALS
 -L
[] Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will intersect inputs if multiple are given.
--LONG_HOMOPOLYMER
 -LH
6 Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.
--MAX_LOCI
 -MAX
0 Maximum number of loci to process (or unlimited if 0).
--MIN_BASE_Q
 -BQ
20 Minimum base quality to include base.
--MIN_MAPPING_Q
 -MQ
20 Minimum mapping quality to include read.
--PRIOR_Q
 -PE
30 The prior error, in phred-scale (used for calculating empirical error rates).
--PROBABILITY
 -P
1.0 The probability of selecting a locus for analysis (for downsampling).
--STRATIFIER_VALUE
null A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS
client_secrets.json Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean  false


--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--ERROR_METRICS / NA

Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.

List[String]  [ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT]


--ERROR_VALUE / NA

A fake argument used to show the options of ERROR (in ERROR_METRICS).

The --ERROR_VALUE argument is an enumerated type (ErrorType), which can have one of the following values:

ERROR
OVERLAPPING_ERROR

ErrorType  null


--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String  client_secrets.json


--help / -h

display the help message

boolean  false


--INPUT / -I

Input SAM or BAM file.

R File  null


--INTERVALS / -L

Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will intersect inputs if multiple are given.

List[File]  []


--LONG_HOMOPOLYMER / -LH

Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.

int  6  [ [ -∞  ∞ ] ]


--MAX_LOCI / -MAX

Maximum number of loci to process (or unlimited if 0).

long  0  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--MIN_BASE_Q / -BQ

Minimum base quality to include base.

int  20  [ [ -∞  ∞ ] ]


--MIN_MAPPING_Q / -MQ

Minimum mapping quality to include read.

int  20  [ [ -∞  ∞ ] ]


--OUTPUT / -O

Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.

R File  null


--PRIOR_Q / -PE

The prior error, in phred-scale (used for calculating empirical error rates).

int  30  [ [ -∞  ∞ ] ]


--PROBABILITY / -P

The probability of selecting a locus for analysis (for downsampling).

double  1.0  [ [ -∞  ∞ ] ]


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--REFERENCE_SEQUENCE / -R

Reference sequence file.

R File  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--STRATIFIER_VALUE / NA

A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).

The --STRATIFIER_VALUE argument is an enumerated type (Stratifier), which can have one of the following values:

ALL
GC_CONTENT
READ_ORDINALITY
READ_BASE
READ_DIRECTION
PAIR_ORIENTATION
PAIR_PROPERNESS
REFERENCE_BASE
PRE_DINUC
POST_DINUC
HOMOPOLYMER_LENGTH
HOMOPOLYMER
BINNED_HOMOPOLYMER
FLOWCELL_TILE
READ_GROUP
CYCLE
BINNED_CYCLE
SOFT_CLIPS
INSERT_LENGTH
BASE_QUALITY
MAPPING_QUALITY
MISMATCHES_IN_READ
ONE_BASE_PADDED_CONTEXT
TWO_BASE_PADDED_CONTEXT
CONSENSUS
NS_IN_READ

Stratifier  null


--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VCF / -V

VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.

R File  null


--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.1.3.0 built at Fri, 9 Aug 2019 21:16:03 -0400.