By Tom White & James Emery

We continue to improve our support for users who want to run on Apache Spark with GATK 4.1. Spark is a popular engine for large-scale data processing that can be deployed in clusters or on a standalone machine with multiple cores - a good way to speed up GATK tools without having to invest in a cluster.

The newest release introduces MarkDuplicatesSpark, a newly out-of-beta replacement redesigned to take advantage of multicore parallelism and run on cloud platforms. When run with large multicore machines on Google Compute Engine, the new and improved version is cost-competitive with Picard MarkDuplicates + SortSam while being significantly faster overall. Other spark improvements include the full ReadsPipelineSpark, powered by a brand new Spark I/O library, Disq.


MarkDuplicatesSpark is a replacement for the old MarkDuplicates + SortSam steps of the best practices pipeline, neither of which take advantage of multicore parallelism. For 4.1 we redesigned the MarkDuplicatesSpark to 1) achieve good parallelism with up to 16 cores on single machine instances and 2) limit the amount of spark shuffles performed and reduce the size of the data that must be parallelized. One challenge with the previous pipeline version was that we often couldn’t afford to run Picard or SortSam on preemptible instances because at five hours the chances of getting preempted was too high. Since MarkDuplicatesSpark finishes in only 3-4 hours on a typical bam, we are able to justify running it preemptively, a significant cost reduction compared to Picard. The Spark version of the MarkDuplicates tool from Picard (MarkDuplicatesSpark) is now certified to run on Spark for production workloads, and scales very well even when run in local mode using the available cores of a single machine.

Here is a graph of cost versus runtime for MarkDuplicatesSpark and the equivalent Picard tools.

Note that on some configurations for 4 and 8 core machines MarkDuplicatesSparks costs are comparable while runtimes are a third compared to the combined MarkDuplicates + SortSam dot on the right of the graph for non-Preemptible machines.

For larger pipelines that require larger computational resources, running on a cluster is best. ReadsPipelineSpark, a pure-Spark workflow combining MarkDuplicates, BQSR, and HaplotypeCaller, can take advantage of a Spark cluster for greatly improved runtimes. Using a 20-node Google Dataproc cluster, we were able to speed up this pipeline on a 30x WGS input BAM from 94 minutes to 55 minutes. We achieved these in large part by using the Spark “Auxiliary files” feature to load side inputs like VCF and Reference files. Previously these files were broadcast to Spark workers and stored in memory. Now we take advantage of Spark’s built-in mechanism for efficiently distributing files across the cluster, and then reading them from the worker’s local disk. GATK code was already optimized for the typical access patterns for reading features from local files, so the new approach gives a significant speed boost. Scripts and tuning suggestions for running the pipeline on Google Dataproc are available in the GATK GitHub repository.

We have devoted a lot of effort to ensure the output of the newest versions of two of our core tools, MarkDuplicatesSpark and HaplotypeCallerSpark, match their non-Spark equivalents, despite boundary effects that come into play because Spark processes in parallel. The new spark tool, RevertSamSpark, replaces the Picard tool RevertSam, and MarkDuplicatesSpark now produces identical output to the Picard version. In releases prior to 4.1, HaplotypeCallerSpark output was not always close to the non-Spark version, so in 4.1 we have added a “strict” mode that matches the regular HaplotypeCaller exactly at the expense of runtime. With “strict” mode disabled HaplotypeCallerSpark will still produce almost identical output due to sharding boundary effects.

Spark-native replacement to Hadoop-BAM MapReduce library Prior to 4.1, GATK relied on the Hadoop-BAM MapReduce library for reading and writing BAM files (and some other common formats like CRAM, SAM, and VCF). Over the last year, we have been working on a native-Spark replacement, Disq, that is more accurate with better performance - particularly when using cloud stores, like Google Cloud Storage. Disq is default in GATK 4.1, making all GATK Spark tools faster across-the-board.

The new Disq codebase also enables features that optimize parallel processing. On-the-fly indexing of BAM files produces BAI files in parallel when writing BAM files from Spark tools. And support for the new Splitting BAM Index (SBI) format facilitates reading BAM files in parallel and is proposed as a new part of the SAM specification.

We constantly strive to create and improve methods and make it accessible to the scientific community to drive biomedical research. So stay tuned for more, new and improved GATK tools!


Return to top

Tue 5 Feb 2019
Comment on this article


- Recent posts


- Upcoming events

See Events calendar for full list and dates


- Recent events

See Events calendar for full list and dates



- Follow us on Twitter

GATK Dev Team

@gatk_dev

@Brunods1001 It’s been updated to use GATK4, which addresses the invalid bam output issue that affected the GATK3 v… https://t.co/AUlbjmHKmm
11 Jul 19
Wrapping up the #GATK workshop in Cambridge, UK -- it's been a blast. Great group of participants and fantastic hos… https://t.co/bvwGTU7lYq
11 Jul 19
Check out this blog post for a quick tour of how to use #GATK workshop tutorial workspaces in @TerraBioApp https://t.co/xNoxTTFejp
11 Jul 19
#GATK workshop resources are now all available in the cloud on @TerraBioApp -- Here's the somatic variant analysis… https://t.co/Tecbcr7joa
11 Jul 19
@brent_p @nilshomer Yes you should be able to get GATK to recalculate those annotations appropriately. Please post… https://t.co/C18Qn5dhS6
9 Jul 19

- Our favorite tweets from others

Me: driving myself insane over what data to keep and what to not bother with for thesis and also frantically trying… https://t.co/er2klIcw5i
18 Jul 19
@RareSeas first attempt at teaching the GATK course, do I look puzzled up there? https://t.co/4mqkHbWJy4
11 Jul 19
Can you spot CDGP PhD student, Dr. Alice Denyer, brushing up on the latest bioinformatics tools from @gatk_dev? The… https://t.co/KAbdlWLbcb
10 Jul 19
GATK workshop materials available online! Learn it in your own time with @ProjectJupyter notebooks. ^MT https://t.co/IKDa6SGwaU
8 Jul 19
Bioinformatics Community Conference 2020 will be July 18-25 in Toronto!🇨🇦The greatness of @OBF_BOSC and #usegalaxy… https://t.co/zp7CeGwoPh
4 Jul 19

See more of our favorite tweets...