# BaseRecalibrator

Detect systematic errors in base quality scores

## Overview

Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes. This tool performs the first step described above: it builds the model of covariation and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).

### Inputs

A BAM file containing data that needs to be recalibrated.

A database of known polymorphic sites to mask out.

### Output

A GATKReport file with many tables:

• The list of arguments
• The quantized qualities table
• The recalibration table by read group
• The recalibration table by quality score
• The recalibration table for all the optional covariates

The GATKReport table format is intended to be easy to read by both humans and computer languages (especially R). Check out the documentation of the GATKReport (in the FAQs) to learn how to manipulate this table.

### Usage example

 java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fasta \
-knownSites latest_dbsnp.vcf \
-o recal_data.table


### Notes

• This *base* recalibration process should not be confused with *variant* recalibration, which is a s ophisticated filtering technique applied on the variant callset produced in a later step of the analysis workflow.
• ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added regardless of whether or not they were specified.

These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator.

### Parallelism options

This tool can be run in multi-threaded mode using this option.

### Downsampling settings

This tool does not apply any downsampling by default.

## Command-line Arguments

### Engine arguments

All tools inherit arguments from the GATK Engine' "CommandLineGATK" argument collection, which can be used to modify various aspects of the tool's function. For example, the -L argument directs the GATK engine to restrict processing to specific genomic intervals; or the -rf argument allows you to apply certain read filters to exclude some of the data from the analysis.

### BaseRecalibrator specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Outputs
--out
-o
NA The output recalibration table file to create
Optional Inputs
--knownSites
[] A database of known polymorphic sites
Optional Parameters
--covariate
-cov
NA One or more covariates to be used in the recalibration. Can be specified multiple times
--indels_context_size
-ics
3 Size of the k-mer context to be used for base insertions and deletions
--maximum_cycle_value
-maxCycle
500 The maximum cycle value permitted for the Cycle covariate
--mismatches_context_size
-mcs
2 Size of the k-mer context to be used for base mismatches
--solid_nocall_strategy
THROW_EXCEPTION Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
--solid_recal_mode
-sMode
SET_Q_ZERO How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
Optional Flags
--list
-ls
false List the available covariates and exit
--lowMemoryMode
false Reduce memory usage in multi-threaded code at the expense of threading efficiency
--no_standard_covs
-noStandard
false Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
--sort_by_all_columns
-sortAllCols
false Sort the rows in the tables of reports
--binary_tag_name
-bintag
NA the binary tag covariate name if using it
--bqsrBAQGapOpenPenalty
-bqsrBAQGOP
40.0 BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets
--deletions_default_quality
-ddq
45 default quality for the base deletions covariate
--insertions_default_quality
-idq
45 default quality for the base insertions covariate
--low_quality_tail
-lqt
2 minimum quality for the bases in the tail of the reads to be considered
--mismatches_default_quality
-mdq
-1 default quality for the base mismatches covariate
--quantizing_levels
-ql
16 number of distinct quality scores in the quantized output
--run_without_dbsnp_potentially_ruining_quality
false If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --binary_tag_name / -bintag

the binary tag covariate name if using it
The tag name for the binary tag covariate (if using it)

String  NA

### --bqsrBAQGapOpenPenalty / -bqsrBAQGOP

BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets

double  40.0  [ [ -∞  ∞ ] ]

### --covariate / -cov

One or more covariates to be used in the recalibration. Can be specified multiple times
Note that the ReadGroup and QualityScore covariates are required and do not need to be specified. Also, unless --no_standard_covs is specified, the Cycle and Context covariates are standard and are included by default. Use the --list argument to see the available covariates.

String[]  NA

### --deletions_default_quality / -ddq

default quality for the base deletions covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]

byte  45  [ [ -∞  ∞ ] ]

### --indels_context_size / -ics

Size of the k-mer context to be used for base insertions and deletions
The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

int  3  [ [ -∞  ∞ ] ]

### --insertions_default_quality / -idq

default quality for the base insertions covariate
A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]

byte  45  [ [ -∞  ∞ ] ]

### --knownSites / -knownSites

A database of known polymorphic sites
This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3

List[RodBinding[Feature]]  []

### --list / -ls

List the available covariates and exit
Note that the --list argument requires a fully resolved and correct command-line to work.

boolean  false

### --low_quality_tail / -lqt

minimum quality for the bases in the tail of the reads to be considered
Reads with low quality bases on either tail (beginning or end) will not be considered in the context. This parameter defines the quality below which (inclusive) a tail is considered low quality

byte  2  [ [ -∞  ∞ ] ]

### --lowMemoryMode / -lowMemoryMode

Reduce memory usage in multi-threaded code at the expense of threading efficiency
When you use nct > 1, BQSR uses nct times more memory to compute its recalibration tables, for efficiency purposes. If you have many covariates, and therefore are using a lot of memory, you can use this flag to safely access only one table. There may be some CPU cost, but as long as the table is really big the cost should be relatively reasonable.

boolean  false

### --maximum_cycle_value / -maxCycle

The maximum cycle value permitted for the Cycle covariate
The cycle covariate will generate an error if it encounters a cycle greater than this value. This argument is ignored if the Cycle covariate is not used.

int  500  [ [ -∞  ∞ ] ]

### --mismatches_context_size / -mcs

Size of the k-mer context to be used for base mismatches
The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.

int  2  [ [ -∞  ∞ ] ]

### --mismatches_default_quality / -mdq

default quality for the base mismatches covariate
A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]

byte  -1  [ [ -∞  ∞ ] ]

### --no_standard_covs / -noStandard

Do not use the standard set of covariates, but rather just the ones listed using the -cov argument
The Cycle and Context covariates are standard and are included by default unless this argument is provided. Note that the ReadGroup and QualityScore covariates are required and cannot be excluded.

boolean  false

### --out / -o

The output recalibration table file to create
After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate.

R File  NA

### --quantizing_levels / -ql

number of distinct quality scores in the quantized output
BQSR generates a quantization table for quick quantization later by subsequent tools. BQSR does not quantize the base qualities, this is done by the engine with the -qq or -BQSR options. This parameter tells BQSR the number of levels of quantization to use to build the quantization table.

int  16  [ [ -∞  ∞ ] ]

### --run_without_dbsnp_potentially_ruining_quality / -run_without_dbsnp_potentially_ruining_quality

If specified, allows the recalibrator to be used without a dbsnp rod. Very unsafe and for expert users only.
This calculation is critically dependent on being able to skip over known polymorphic sites. Please be sure that you know what you are doing if you use this option.

boolean  false

### --solid_nocall_strategy / -solid_nocall_strategy

Defines the behavior of the recalibrator when it encounters no calls in the color space. Options = THROW_EXCEPTION, LEAVE_READ_UNRECALIBRATED, or PURGE_READ
BaseRecalibrator accepts a --solid_nocall_strategy flag which governs how the recalibrator handles no calls in the color space tag. Unfortunately because of the reference inserted bases mentioned above, reads with no calls in their color space tag can not be recalibrated.

The --solid_nocall_strategy argument is an enumerated type (SOLID_NOCALL_STRATEGY), which can have one of the following values:

THROW_EXCEPTION
When a no call is detected throw an exception to alert the user that recalibrating this SOLiD data is unsafe. This is the default option.
Leave the read in the output bam completely untouched. This mode is only okay if the no calls are very rare.
Mark these reads as failing vendor quality checks so they can be filtered out by downstream analyses.

SOLID_NOCALL_STRATEGY  THROW_EXCEPTION

### --solid_recal_mode / -sMode

How should we recalibrate solid bases in which the reference was inserted? Options = DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, or REMOVE_REF_BIAS
BaseRecalibrator accepts a --solid_recal_mode flag which governs how the recalibrator handles the reads which have had the reference inserted because of color space inconsistencies.

The --solid_recal_mode argument is an enumerated type (SOLID_RECAL_MODE), which can have one of the following values:

DO_NOTHING
Treat reference inserted bases as reference matching bases. Very unsafe!
SET_Q_ZERO
Set reference inserted bases and the previous base (because of color space alignment details) to Q0. This is the default option.
SET_Q_ZERO_BASE_N
In addition to setting the quality scores to zero, also set the base itself to 'N'. This is useful to visualize in IGV.
REMOVE_REF_BIAS
Look at the color quality scores and probabilistically decide to change the reference inserted base to be the base which is implied by the original color space instead of the reference.

SOLID_RECAL_MODE  SET_Q_ZERO

### --sort_by_all_columns / -sortAllCols

Sort the rows in the tables of reports
Whether GATK report tables should have rows in sorted order, starting from leftmost column

Boolean  false