# DictionaryDefinitions of terms used in the docs

Created 2015-10-25 05:12:37 | Updated 2016-03-07 15:23:03 |

Bait bias (single bait bias or reference bias artifact) is a type of artifact that affects data generated through hybrid selection methods.

These artifacts occur during or after the target selection step, and correlate with substitution rates that are biased or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, a G>T artifact during the target selection step might result in a higher (G>T)/(C>A) substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the "G-Ref" artifact.

Created 2015-11-18 16:10:36 | Updated 2015-12-01 23:31:29 |

A biallelic site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a deletion at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele G.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T - C G
Sample 3 : A T A T A T - C G
Sample 4 : A T A T A T G C G

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a deletion or a SNP at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele G, a C (SNP), or a T (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T C C G
Sample 3 : A T A T A T T C G
Sample 4 : A T A T A T G C G

Created 2015-10-25 04:33:04 | Updated 2016-02-08 18:21:54 |

Cytosine methylation is a key component in epigenetic regulation of gene expression and frequently occurs at CpG sites throughout the genome. Bisulfite sequencing is a technique used to analyze the genome-wide methylation profiles on a single nucleotide level [doi:10.1093/nar/gki901]. Sodium bisulfite efficiently and selectively deaminates unmethylated cytosine residues to uracil without affecting 5-methyl cytosine (methylated). Using restriction enzymes and PCR to enrich for regions of the genome that have high CpG content, the resulting reduced genome comprises ~1% of the original genome but includes key regulatory sequences as well as repeated regions.

The protocol involves several steps. First, genomic DNA is digested with a restriction endonuclease such as MspI, which targets CG dinucleotides. This results in DNA fragments with CG at the ends. Next, the fragments are size selected (via gel electrophoresis), which facilitates the enrichment of CpG-containing sequences. This is followed by bisulfite treatment, which converts unmethylated C nucleotides to uracil (U) while methylated cytosines will remain intact. The bisulfite-treated DNA is amplified with a proofreading-deficient DNA polymerase to facilitate amplification of both methylated cytosines as well as the C -> U converted bases. Subsequent to PCR amplification, each original unmethylated cytosine will be converted to either a T (+ strand) or an A (- strand), while methylated C will remain a C (+ strand) or a G (- strand). The PCR products are then sequenced using conventional methods and aligned to a reference.

Created 2012-08-11 05:16:06 | Updated 2015-12-19 10:53:18 |

#### Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.

## 1. Downsampling to a target coverage

### Defaults

The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

• The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
• The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
• The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

### Customizing

From the command line:

• To disable the downsampler, specify -dt NONE.
• To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

• Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

### Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

• Select reads with the next alignment start.
• While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

• If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
• Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

## 2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.

Created 2016-11-17 18:33:06 | Updated |

### Heterozygosity in population genetics

In the context of population genetics, heterozygosity can refer to the fraction of individuals in a given population that are heterozygous at a given locus, or the fraction of loci that are heterozygous in an individual. See the Wikipedia entries on Heterozygosity and Coalescent Theory as well as the book "Population Genetics: A Concise Guide" by John H. Gillespie for further details on related theory.

### Heterozygosity in GATK

In GATK genotyping, we use an "expected heterozygosity" value to compute the prior probability that a locus is non-reference. Given the expected heterozygosity hets, we calculate the probability of N samples being hom-ref at a site as 1 - sum_i_2N (hets / i). The default value provided for humans is hets = 1e-3; a value of 0.001 implies that two randomly chosen chromosomes from the population of organisms would differ from each other at a rate of 1 in 1000 bp. In this context hets is analogous to the parameter theta from population genetics. The hets parameter value can be modified if desired.

Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype, which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there may be an AB heterozygous genotype. The posterior probability of this AB genotype would use the hets prior, but the GATK only uses this posterior probability in determining the probability that a site is polymorphic. So changing the hets parameters only increases the chance that a site will be called non-reference across all samples, but doesn't actually change the output genotype likelihoods at all, as these aren't posterior probabilities. The one quantity that changes whether the GATK considers the possibility of a heterozygous genotype at all is the ploidy, which describes how many copies of each chromosome each individual in the species carries.

Created 2015-10-25 04:56:30 | Updated 2016-03-07 15:29:20 |

Hybrid selection is a method that enables selection of specific sequences from a pool of genomic DNA for targeted sequencing analyses via pull-down assays. Typical applications include the selection of exome sequences or pathogen-specific sequences in complex biological samples. Hybrid selection involve the use baits to select desired fragments.

Briefly, baits are RNA (or sometimes DNA) molecules synthesized with biotinylated nucleotides. The biotinylated nucleotides are ligands for streptavidin enabling enabling RNA:DNA hybrids to be captured in solution. The hybridization targets are sheared genomic DNA fragments, which have been "polished" with synthetic adapters to facilitate PCR cloning downstream. Hybridization of the baits with the denatured targets is followed by selective capture of the RNA:DNA "hybrids" using streptavidin-coated beads via pull-down assays or columns.

Systematic errors, ultimately leading to sequence bias and incorrect variant calls, can arise at several steps. See the GATK dictionary entries bait bias and pre-adapter artifacts for more details.

Please see the following reference for the theory behind this technique.

Created 2015-10-25 03:24:42 | Updated |

Jumping libraries are created to bypass difficult to align/map regions, such as those containing repetitive DNA sequences. Briefly, the DNA of interest is identified, cut into fragments either with restriction enzymes or by shearing. The size-selected fragments are ligated to adapters for bead-capture and circularized. After bead-capture, the DNA is linearized via restriction enzymes, and can be sequenced using adapter primers facing in outward [reverse/forward (RF)] directions. These library inserts are considered jumping because the ends originate from distal genomic DNA sequences and are ligated adjacent to one another during circularization. Potential artifacts of this method include small inserts (lacking the linearizing restriction enzyme sequence), which are inward-facing [forward/reverse (FR)] (non-jumping) read pairs. In addition, chimeras result from the paired ends falling on different chromosomes, the insert size exceeding the maximum of 100 KB, or two times the mode of the insert size for outward-facing pairs. For additional information, see the Wikipedia article.

Created 2016-06-22 20:02:04 | Updated 2016-07-01 14:19:37 |

There are several instances in the GATK documentation where you will encounter the terms "likelihood" and "probability", because key tools in the variant discovery workflow rely heavily on Bayesian statistics. For example, the HaplotypeCaller, our most prominent germline SNP and indel caller, uses Bayesian statistics to determine genotypes.

#### So what do likelihood and probability mean and how are they related to each other in the Bayesian context?

In Bayesian statistics (as opposed to frequentist statistics), we are typically trying to evaluate the posterior probability of a hypothesis (H) based on a series of observations (data, D).

Bayes' rule states that

$${P(H|D)}=\frac{P(H)P(D|H)}{P(D)}$$

where the bit we care about most, P(D|H), is the probability of observing D given the hypothesis H. This can also be formulated as L(H|D), i.e. the likelihood of the hypothesis H given the observation D:

$$P(D|H)=L(H|D)$$

We use the term likelihood instead of probability to describe the term on the right because we cannot calculate a meaningful probability distribution on a hypothesis, which by definition is binary (it will either be true or false) -- but we can determine the likelihood that a hypothesis is true or false given a set of observations. For a more detailed explanation of these concepts, please see the following lesson (http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading11.pdf).

Now you may wonder, what about the posterior probability P(H|D) that we eventually calculate through Bayes' rule? Isn't that a "probability of a hypothesis"? Well yes; in Bayesian statistics, we can calculate a posterior probability distribution on a hypothesis, because its probability distribution is relative to all of the other competing hypotheses (http://www.smbc-comics.com/index.php?id=4127). Tadaa.

See this HaplotypeCaller doc article for a worked out explanation of how we calculate and use genotype likelihoods in germline variant calling.

So always remember this, if nothing else: the terms likelihood and probability are not interchangeable in the Bayesian context, even though they are often used interchangeably in common English.

A special thanks to Jon M. Bloom PhD (MIT) for his assistance in the preparation of this article.

Created 2016-02-11 16:20:11 | Updated 2016-02-17 05:16:37 |

### Mate unmapped records are identifiable using the 8 SAM flag.

It is possible for a BAM to have multiple types of mate-unmapped records. These mate unmapped records are distinct from mate missing records, where the mate is altogether absent from the BAM. Of the three types of mate unmapped records listed below, we describe only the first two in this dictionary entry.

1. Singly mapping pair.
2. A secondary/supplementary record is flagged as mate-unmapped but the mate is in fact mapped.
3. Both reads in a pair are unmapped.

### (1) Singly mapping pair

A mapped read's unmapped mate is marked in their SAM record in an unexpected manner that allow the pair to sort together. If you look at these unmapped reads, the alignment columns 2 and 3 indicate they align, in fact identically to the mapped mate. However, what is distinct is the asterisk * in the CIGAR field (column 6) that indicates the record is unmapped. This allows us to (i) identify the unmapped read as having passed through the aligner, and (ii) keep the pairs together in file manipulations that use either coordinate or queryname sorted BAMs. For example, when a genomic interval of reads are taken to create a new BAM, the pair remain together. For file manipulations dependent on such sorting, we can deduce that these mate unmapped records are immune to becoming missing mates.

### (2) Mate unmapped record whose mate is mapped but in a pair that excludes the record

The second type of mate unmapped records apply to multimapping read sets processed through MergeBamAlignment such as in Tutorial#6483. Besides reassigning primary and secondary flags within multimapping sets according to a user specified strategy, MergeBamAlignment marks secondary records with the mate unmapped flag. Specifically, after BWA-MEM alignment, records in multimapping sets are all each mate-mapped. After going through MergeBamAlignment, the secondary records become mate-unmapped. The primary alignments remain mate-mapped. This effectively minimizes the association between secondary records from their previous mate.

### How do tools treat them differently?

GATK tools typically ignore secondary/supplementary records from consideration. However, tools will process the mapped read in a singly mapping pair. For example, MarkDuplicates skips secondary records from consideration but marks duplicate singly mapping reads.

Created 2015-10-25 03:53:57 | Updated 2015-10-25 05:08:32 |

Oxidation of guanine to 8-oxoguanine is one of the most common pre-adapter artifacts associated with genomic library preparation, arising from a combination of heat, shearing, and metal contaminates in a sample (doi: 10.1093/nar/gks1443). The 8-oxoguanine base can pair with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification.

This occurs when a G on the template strand is oxidized, giving it an affinity for binding to A rather than the usual C. Thus, PCR will introduce apparent G>T substitutions in read 1 and C>A in read 2. In the resulting alignments, a given G>T or C>A observation could either be:

1. a true mutation
2. an 8-oxoguanine artifact
3. some other kind of artifact.

The variants (C→A)/(G→T) tend to occur in specific sequence contexts e.g. CCG→CAG (doi:10.1093/nar/gks1443). Although occurring at relatively low frequencies, these artifacts can have profound impacts on variant calling fidelity (doi:10.1093/nar/gks1443).

Created 2015-10-25 04:20:59 | Updated 2016-02-03 20:23:22 |

Illumina sequencers perform an internal quality filtering procedure called chastity filter, and reads that pass this filter are called PF for pass-filter. According to Illumina, chastity is defined as the ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities. Clusters of reads pass the filter if no more than 1 base call has a chastity value below 0.6 in the first 25 cycles. This filtration process removes the least reliable clusters from the image analysis results.

• Illumina, Inc. (2015). Calculating Percent Passing Filter for Patterned and Non-Patterned Flow Cells: A comparison of methods for calculating percent passing filter on Illumina flow cells
• Ilumina Inc. (2014) HiSeq X System user guide

Both articles can be found at http://www.Illumina.com

Created 2015-10-25 03:36:07 | Updated 2016-04-23 01:56:05 |

In paired-end sequencing, the library preparation yields a set of fragments, and the machine sequences each fragment from both ends; for example if you have a 300bp contiguous fragment, the machine will sequence e.g. bases 1-75 (forward direction) and bases 225-300 (reverse direction) of the fragment.

In mate-pair sequencing, the library preparation yields two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment.

The three read orientation categories are forward reverse (FR), reverse forward (RF), and reverse-reverse/forward-forward (TANDEM). In general, paired-end reads tend to be in a FR orientation, have relatively small inserts (~300 - 500 bp), and are particularly useful for the sequencing of fragments that contain short repeat regions. Mate-pair fragments are generally in a RF conformation, contain larger inserts (~3 kb), and enable sequence coverage of genomic regions containing large structural rearrangements. Tandem reads can result from inversions and rearrangements during library preparation.

Here is a more illustrative example:

FR: 5' --F--> <--R-- 5' (in slang called "innie" because they point inward)

RF: <--R-- 5' 5' --F--> (in slang called "outie" because they point outward)

TANDEM: 5' --F--> 5' --F--> or <--R-- 5' <--R-- 5'

The figure below illustrates this graphically along with the SAM flags that correspond to the FR and RF configurations.

For detailed explanations of library construction strategies (for Illumina sequencers) and how read orientations are determined, please see:

Created 2012-12-18 21:35:34 | Updated 2016-06-29 04:58:17 |

This document explains the concepts involved and how they are applied within the GATK (and Crom+WDL or Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

## 1. The concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

### Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

• open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

• open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

• open the file, index the lines

• read the first line, count the number of words, tell us the number
• read the second line, count the number of words, tell us the number
• read the third line, count the number of words, tell us the number
• [repeat for all lines]

• collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

## 2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

### A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster or cloud.

• Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

• Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

• Cluster or cloud: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

• -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

### Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves separate programs. There are two pipelining solutions that we support for scatter-gathering GATK jobs, Crom+WDL and Queue. They are quite different, but both are able to generate separate GATK jobs (each with its own command-line) to achieve the instructions given in a script.

At the simplest level, the script can involve a single GATK tool*. In that case, the execution engine (Cromwell or Queue) will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, the engine will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue and Cromwell have additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation. To learn more about Crom+WDL, see the WDL website.

### Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But an execution engine like Queue or Cromwell can dispatch scattered GATK jobs to different machines in a computing cluster or on a cloud platform by interfacing with the appropriate job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue or Cromwell to scatter GATK jobs to different nodes on your cluster or cloud platform, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Created 2016-05-27 22:53:16 | Updated |

A pedigree is a structured description of the familial relationships between samples.

Some GATK tools are capable of incorporating pedigree information in the analysis they perform if provided in the form of a PED file through the --pedigree (or -ped) argument.

### PED file format

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:

• Family ID
• Individual ID
• Paternal ID
• Maternal ID
• Sex (1=male; 2=female; other=unknown)
• Phenotype

The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. If an individual's sex is unknown, then any character other than 1 or 2 can be used in the fifth column.

A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an "affected status" column: GATK will automatically detect which type (i.e. based on whether a value other than 0, 1, 2 or the missing genotype code is observed).

Affected status should be coded as follows:

• -9 missing
• 0 missing
• 1 unaffected
• 2 affected

If any value outside of -9,0,1,2 is detected, then the samples are assumed to have phenotype values, interpreted as string phenotype values.

Note that genotypes (column 7 onwards) cannot be specified to the GATK.

You can add a comment to a PED or MAP file by starting the line with a # character. The rest of that line will be ignored, so make sure none of the IDs start with this character.

Each -ped argument can be tagged with NO_FAMILY_ID, NO_PARENTS, NO_SEX, NO_PHENOTYPE to tell the GATK PED parser that the corresponding fields are missing from the ped file.

#### Example

Here are two individuals (one row = one person):

FAM001  1  0 0  1  2
FAM001  2  0 0  1  2


Created 2014-06-05 16:10:25 | Updated 2016-07-19 21:24:33 |

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

### Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

### Phred scale in practice

In today’s sequencing output, by convention, most useable Phred-scaled base quality scores range from 2 to 40, with some variations in the range depending on the origin of the sequence data (see the FASTQ format documentation for details). However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$Q = -10 \log E$$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$E = 10 ^{-\left(\frac{Q}{10}\right)}$$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$\begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray}$$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

Created 2015-10-25 05:09:13 | Updated 2016-01-06 21:19:49 |

Various sources of error affect the hybrid selection (HS) process. Pre-adapter artifacts are those that arise in the preparation step(s) prior to the ligation of the PCR adapters. These artifacts occur on the original template strand, before the addition of adapters, so they correlate with read number orientation in a specific way.

A classic example is the shearing of target genomic DNA leading to oxidation of an amine of guanine at position 8 8-oxoguanine (8-OxoG, OxoG) (doi:10.1093/nar/gks1443) (see also OxoG entry in this dictionary).

Created 2015-11-20 19:22:28 | Updated 2016-05-20 19:43:39 |

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

### Meaning of the read group fields required by GATK

• ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

• PU = Platform Unit The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

• SM = Sample The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.

• PL = Platform/technology used to produce the read This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

• LB = DNA preparation library identifier MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

### Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

### Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Created 2016-06-22 19:13:25 | Updated 2017-01-13 17:11:41 |

#### Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section.

This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.

GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.

The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.

Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.

• For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y. Confirm the set you are using by viewing a PAR region of the Y chromosome on IGV as shown in the figure below. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415. The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent soft-masked sequence corresponding to repeats from RepeatMasker and Tandem Repeats Finder.
• The GRCh38 analysis sets also include a contig to siphon off reads corresponding to the Epstein-Barr virus sequence as well as decoy contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with EBV transformation, as well as capture endogenous EBV sequence as EBV naturally infects B cells in ~90% of the world population. Heng Li provides the decoy contigs.

## Nomenclature: words to describe components of reference genomes

• A contig is a contiguous sequence without gaps.

• Alternate contigs, alternate scaffolds or alternate loci allow for representation of diverging haplotypes. These regions are too complex for a single representation. Identify ALT contigs by their _alt suffix.

The GRCh38 ALT contigs total 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.

• Primary assembly refers to the collection of (i) assembled chromosomes, (ii) unlocalized and (iii) unplaced sequences. It represents a non-redundant haploid genome.

(i) Assembled chromosomes for hg38 are chromosomes 1–22 (chr1chr22), X (chrX), Y (chrY) and Mitochondrial (chrM). (ii) Unlocalized sequence are on a specific chromosome but with unknown order or orientation. Identify by _random suffix. (iii) Unplaced sequence are on an unknown chromosome. Identify by chrU_ prefix.

• PAR stands for pseudoautosomal region. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or pseudo-autosomal sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. Analysis set genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions.

• Different assemblies shift coordinates for loci and are released infrequently. Hg19 and hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly. In the special case of hg19 and GRCh37, the primary assembly coordinates are identical for loci but patch updates differ. Also, the naming conventions of the references differ, e.g. the use of chr1 versus 1 to indicate chromosome 1, such that these also require lift-over to compare data. GRCh38/hg38 unifies the assemblies and the naming conventions.

• Patches are regional fixes that are released periodically for a given assembly. GRCh38.p7 indicates the seventh patched minor release of GRCh38. This NCBI page explains in more detail. Patches add information to the assembly without disrupting the chromosome coordinates. Again, they improve representation without affecting chromosome coordinate stability. The two types of patches, fixed and novel, represent different types of sequence.

(i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes. (ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.

## The GATK perspective on reference genomes

Within GATK documentation, Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.

GATK tools allow for use of a genomic intervals list that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.

#### Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

• For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
• When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
• However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

 -L HLA-A*01:01:01:01:1+

### Viewing CRAM alignments on genome browsers

Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.

### Should I switch to a newer reference?

Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.

The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing.

Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.

## External resources

1. New 11/16/2016 For a brief history and discussion on challenges in using GRCh38, see the 2015 Genome Biology article Extending reference assembly models by Church et al. (DOI: 10.1186/s13059-015-0587-3).
2. For press releases highlighting improvements in GRCh38 from December 2013, see http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/ and http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html. The latter post summarizes major improvements, including the correction of thousands of SNPs and indels in GRCh37 not seen in the population and the inclusion of synthetic centromeric sequence.
3. Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information. See these pages for download and installation instructions.
4. The Genome Reference Consortium (GRC) provides human, mouse, zebrafish and chicken sequences, and this particular webpage gives an overview of GRCh38. Namely, an interactive chromosome ideogram marks regions with corresponding alternate loci, regions with fix patches and regions containing novel patches. For additional assembly terminology, see http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml.
5. The UCSC Genome Browser allows browsing and download of genomes, including analysis sets, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt and ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:

    585    chr14    0    10000    1    N    10000    telomere    no
1    chr14    10000    16000000    2    N    15990000    short_arm    no
707    chr14    16022537    16022637    4    N    100    contig    no
6. The Integrative Genomics Viewer is a desktop application for viewing genomics data including alignments. The tool accesses reference genomes you provide via file or URL or that it hosts over a server. The numerous hosted reference genomes include GRCh38. See this page for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to Genomes>Load Genome From Server. A menu lists genomes you can make available in the main genome dropdown menu.

Created 2016-02-03 17:28:43 | Updated 2016-02-10 19:03:10 |

We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.

Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.

In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.