Showing tool doc from version 4.0.4.0 | The latest version is 4.1.4.1

CollectSequencingArtifactMetrics (Picard)

Collect metrics to quantify single-base sequencing artifacts.

This tool examines two sources of sequencing errors associated with hybrid selection protocols. These errors are divided into two broad categories, pre-adapter and bait-bias. Pre-adapter errors can arise from laboratory manipulations of a nucleic acid sample e.g. shearing and occur prior to the ligation of adapters for PCR amplification (hence the name pre-adapter).

Bait-bias artifacts occur during or after the target selection step, and correlate with substitution rates that are 'biased', or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, during the target selection step, a (G>T) artifact might result in a higher substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the 'G-Ref' artifact.

For additional information on these types of artifacts, please see the corresponding GATK dictionary entries on bait-bias and pre-adapter artifacts.

This tool produces four files; summary and detail metrics files for both pre-adapter and bait-bias artifacts. The detailed metrics show the error rates for each type of base substitution within every possible triplet base configuration. Error rates associated with these substitutions are Phred-scaled and provided as quality scores, the lower the value, the more likely it is that an alternate base call is due to an artifact. The summary metrics provide likelihood information on the 'worst-case' errors.

Usage example:

java -jar picard.jar CollectSequencingArtifactMetrics \
I=input.bam \
O=artifact_metrics.txt \
R=reference_sequence.fasta
Please see the metrics at the following links PreAdapterDetailMetrics, PreAdapterSummaryMetrics, BaitBiasDetailMetrics, and BaitBiasSummaryMetrics for complete descriptions of the output metrics produced by this tool.

Category Diagnostics and Quality Control


Overview

Quantify substitution errors caused by mismatched base pairings during various stages of sample / library prep. We measure two distinct error types - artifacts that are introduced before the addition of the read1/read2 adapters ("pre adapter") and those that are introduced after target selection ("bait bias"). For each of these, we provide summary metrics as well as detail metrics broken down by reference context (the ref bases surrounding the substitution event). For a deeper explanation, see Costello et al. 2013: http://www.ncbi.nlm.nih.gov/pubmed/23303777

CollectSequencingArtifactMetrics (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--INPUT
 -I
null Input SAM or BAM file.
--OUTPUT
 -O
null File to write the output to.
--REFERENCE_SEQUENCE
 -R
null Reference sequence file.
Optional Tool Arguments
--arguments_file
[] read one or more arguments files and add them to the command line
--ASSUME_SORTED
 -AS
true If true (default), then the sort order in the header file will be ignored.
--CONTEXT_SIZE
1 The number of context bases to include on each side of the assayed base.
--CONTEXTS_TO_PRINT
[] If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.
--DB_SNP
null VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.
--FILE_EXTENSION
 -EXT
null Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null
--help
 -h
false display the help message
--INCLUDE_DUPLICATES
 -DUPES
false Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.
--INCLUDE_NON_PF_READS
 -NON_PF
false Whether or not to include non-PF reads.
--INCLUDE_UNPAIRED
 -UNPAIRED
false Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.
--INTERVALS
null An optional list of intervals to restrict analysis to.
--MAXIMUM_INSERT_SIZE
 -MAX_INS
600 The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.
--MINIMUM_INSERT_SIZE
 -MIN_INS
60 The minimum insert size for a read to be included in analysis.
--MINIMUM_MAPPING_QUALITY
 -MQ
30 The minimum mapping quality score for a base to be included in analysis.
--MINIMUM_QUALITY_SCORE
 -Q
20 The minimum base quality score for a base to be included in analysis.
--STOP_AFTER
0 Stop after processing N reads, mainly for debugging.
--TANDEM_READS
 -TANDEM
false Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.
--USE_OQ
true When available, use original quality scores for filtering.
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS
client_secrets.json Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--ASSUME_SORTED / -AS

If true (default), then the sort order in the header file will be ignored.

boolean  true


--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CONTEXT_SIZE / NA

The number of context bases to include on each side of the assayed base.

int  1  [ [ -∞  ∞ ] ]


--CONTEXTS_TO_PRINT / NA

If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.

Set[String]  []


--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean  false


--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--DB_SNP / NA

VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.

File  null


--FILE_EXTENSION / -EXT

Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null

String  null


--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String  client_secrets.json


--help / -h

display the help message

boolean  false


--INCLUDE_DUPLICATES / -DUPES

Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.

boolean  false


--INCLUDE_NON_PF_READS / -NON_PF

Whether or not to include non-PF reads.

boolean  false


--INCLUDE_UNPAIRED / -UNPAIRED

Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.

boolean  false


--INPUT / -I

Input SAM or BAM file.

R File  null


--INTERVALS / NA

An optional list of intervals to restrict analysis to.

File  null


--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--MAXIMUM_INSERT_SIZE / -MAX_INS

The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.

int  600  [ [ -∞  ∞ ] ]


--MINIMUM_INSERT_SIZE / -MIN_INS

The minimum insert size for a read to be included in analysis.

int  60  [ [ -∞  ∞ ] ]


--MINIMUM_MAPPING_QUALITY / -MQ

The minimum mapping quality score for a base to be included in analysis.

int  30  [ [ -∞  ∞ ] ]


--MINIMUM_QUALITY_SCORE / -Q

The minimum base quality score for a base to be included in analysis.

int  20  [ [ -∞  ∞ ] ]


--OUTPUT / -O

File to write the output to.

R File  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--REFERENCE_SEQUENCE / -R

Reference sequence file.

R File  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--STOP_AFTER / NA

Stop after processing N reads, mainly for debugging.

long  0  [ [ -∞  ∞ ] ]


--TANDEM_READS / -TANDEM

Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.

boolean  false


--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--USE_OQ / NA

When available, use original quality scores for filtering.

boolean  true


--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.0.4.0 built at 23-40-2018 11:40:56.