The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.
- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet
Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.
-L 20 (for chromosome 20 in b37/b39 build)
-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)
HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the
-L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use
-L chr1:1-100. This also works for our HLA contig, e.g.
However, when passing in an entire contig, for contigs with colons in the name, you must add
:1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.
It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!
Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.
By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.
Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.
Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.
|Tool||-L?||Why / why not|
|RealignerTargetCreator||YES||Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.|
|IndelRealigner||NO||IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.|
|BaseRecalibrator||YES||This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.|
|PrintReads||NO||Output is a bam file; using -L would lead to lost data.|
|UnifiedGenotyper/Haplotype Caller||YES||We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.|
|Next steps||NO||No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.|
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.
You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).