The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
|Prod* germline short variant per-sample calling||uBAM to GVCF||optimized for GCP||yes||pending||hg38 & b37|
|$5 Genome Analysis Pipeline||uBAM to GVCF or cohort VCF||optimized for GCP (see blog)||yes||hg38||yes|
|Generic data pre-processing||uBAM to analysis-ready BAM||universal||yes||hg38 & b37||hg38 & b37|
* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.
This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).
Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.
We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.
This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.
This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but a single read pair within each set of duplicates, causing the marked pairs to be ignored by default during the variant discovery process. At this stage the reads also need to be sorted into coordinate-order for the next step of the pre-processing. MarkDuplicatesSpark performs both the duplicate marking step and the sort step for this stage of pre-processing. This phase of the pipeline has historically been a performance bottleneck due to the large number of comparisons made between read pairs in a sample so MarkDuplicatesSpark utilizes Apache Spark in order to parallelize the process to better take advantage all available resources. This tool can be run locally even without access to a dedicated Spark cluster.
As an alternative to MarkDuplicatesSpark this step can be performed by using the Picard implementation of MarkDuplicates to perform the duplicate marking followed by SortSam to sort the reads. Both of these tools are currently implemented as single threaded tools and consequently cannot take advantage of core parallelism. It is advised to run MarkDuplicatesSpark over MarkDuplicates followed by SortSam on machines with a high number of cores or on machines with access to fast disk drives. MarkDuplicatesSpark produces bit-wise equivalent output to this approach if run following the best practices so either set of tools is valid.
This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.