Data pre-processing for variant discovery
Best Practices Workflows | Created 2018-01-09 | Last updated 2018-01-09

Comments (0)


The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

Reference Implementations

Pipeline Summary Notes Github FireCloud
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP :) TBD
Generic data pre-processing uBAM to analysis-ready BAM universal :) TBD

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).

Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.

Main steps

We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

Map to Reference

Tools involved: BWA, MergeBamAlignments

This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.

Mark Duplicates

Tools involved: MarkDuplicates, SortSam

This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Return to top Comment on this article