Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.

You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:


The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.

The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.

So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.

See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.

Let us know what you think!


Return to top

mgymrek


The link to the "gory details" just goes to this page. Where can I find these gory details?

Thu 6 Mar 2014

Geraldine_VdAuwera


Link fixed.

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, I have started this program on my bam file containing all chromosomes (mouse). It managed to call genotypes for 3 chromosomes in 5 days. I cannot run programs for longer than that due to limits in wall time. Do you recommend to run each chromosome separately, maybe? Or is there any other way to speed up the calculations? Thanks and best, Bettina

Thu 6 Mar 2014

mike_boursnell


Hi Geraldine, This new method seems terribly slow. I am running a batch of BAM files which are about 2.5GB each, over only a 14MB region of the genome, and each one is taking about 12 hours. I am using nct=1 because I can't risk using higher values as that was unreliable in the past. Am I doing something wrong? java -Xmx4g -Djava.io.tmpdir=javatempdir -jar GenomeAnalysisTK.jar -R canfam3.fasta -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L chr13:38900000-53300000 -I AS_case_07_final.bam -o gVCF1_AS_case_07_final_variants.vcf -S STRICT -nct 1

Thu 6 Mar 2014

Geraldine_VdAuwera


@Bettina_Harr‌ and @mike_boursnell‌ That's odd as in our hands this goes much faster than the previous versions. Are you working with very deep coverage? You can certainly parallelize by running on each chromosome individually, to help cut down on runtimes.

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, no my coverage is not extremely high, average 20-fold per site.

Thu 6 Mar 2014

mike_boursnell


Quite high coverage 300-400. What is the 128000 parameter?

Thu 6 Mar 2014

Geraldine_VdAuwera


The 128000 parameter has to do with the index compression scheme -- I think it specifies block size. Let me call in some of the devs who have been working on the HC's performance. Are you running 3.0 or 3.1?

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, ok, I have restarted the program processing one chromosome at a time. Looking at my intermediate output, I noticed that the software issued a warning: WARN 02:33:07,773 DiploidExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:65884356 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site? Thanks!

Thu 6 Mar 2014

Bettina_Harr


oh, I am running 2.8.1. Did not know there was already a new version. Could this be part of the reason? I have done all my pre-processing now with 2.8.1. Can I switch to 3.1 with the HaplotypeCaller? Or should I stay with 2.8.1? Thanks!

Thu 6 Mar 2014

Geraldine_VdAuwera


> I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site? Yes, this is typically due to indels in a context with homopolymers and/or repeats. You can limit the number of max alt alleles if you want, which will make the program run faster. > I am running 2.8.1. You should definitely switch to 3.1; this new method is specifically designed for work in 3.1, and involves new code that is not in 2.8.

Thu 6 Mar 2014

Bettina_Harr


ok, but I do not need to repeat the pre-processing, i.e. indel and base recalibration, when switching to 3.1 now, correct?

Thu 6 Mar 2014

Carneiro


Correct @Bettina_Harr‌ you don't need to reprocess (BQSR & Indel Realignment) to switch. Just run the haplotype caller. I actually didn't even know that we made the reference model available on 2.8, it was *very* experimental at that point. Try 3.1 and let us know how it is.

Thu 6 Mar 2014

Geraldine_VdAuwera


@mike_boursnell To clarify, are you finding that running HC in GVCF mode is much slower than previously?

Thu 6 Mar 2014

mike_boursnell


Yes. That's right.

Thu 6 Mar 2014

Geraldine_VdAuwera


@mike_boursnell‌ Do you have any comparison data, e.g. runtimes for running two different commands on the same data? We'd like to get a sense of what the difference is. The HC does do quite a bit more work in GVCF mode, so it's expected that the upfront runtime will be a little more, but not orders of magnitude.

Thu 6 Mar 2014

mike_boursnell


OK, I'll have a look. I'm not sure how to compensate for other things running on our server at the same time. Any suggestions?

Thu 6 Mar 2014

Geraldine_VdAuwera


Not really -- if you were able to reserve a set of nodes that would be ideal of course. Not knowing your setup/access I can't really advise. We have dedicated machines for testing this kind of thing so it's a lot easier.

Thu 6 Mar 2014

sergiomv


Hi Geraldine, I run GATK 3.1 Unified Genotyper (UG) and Haplotype Caller (HC) for whole genome NA12878. With UG I have my vcf file with all variants called (near 4 milions), but when I run HC I obtained very few variants (near 1,3 milions) and in the log of HG I see this warning (a lot of): WARN 05:26:17,926 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:7458214 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. My BWA alignment is OK I see only two alternate alleles (diploid). Do you know why of this warning ? Thanks.

Thu 6 Mar 2014

Geraldine_VdAuwera


Hi @sergiomv‌, First I would say that the different numbers of variants is most probably unrelated to the warnings you're seeing. I can't really comment on those numbers; you should try to examine what is the overlap and what are the major differences between the two callsets. Have a look at the distribution of annotation values. See if you are missing major blocks of variants, or if there is a correlation with annotation values, e.g. if the variants HC is missing all have low MQ in the UG callset. HaplotypeCaller is more stringent about thing slike MQ than UG. Regarding the alleles question, I would bet that HC is seeing multiple possible deletion alleles in your data, because of the low complexity of the context. Keep in mind that the HC does a de novo reassembly step with the data so it may come to a different alignment than BWA. And since considering multiple alleles decreases performance, the HC is set to only proceed with the most likely alleles and discard the rest. You don't need to worry about this.

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, a couple of questions: 1) I ran the Haplotype Caller and in the end I am getting a short summary of my reads. INFO 05:16:33,184 ProgressMeter - Total runtime 30549.58 secs, 509.16 min, 8.49 hours INFO 05:16:33,185 MicroScheduler - 5481436 reads were filtered out during the traversal out of approximately 21788872 total reads (25.16%) INFO 05:16:33,186 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 05:16:33,186 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 05:16:33,187 MicroScheduler - -> 5434226 reads (24.94% of total) failing HCMappingQualityFilter INFO 05:16:33,187 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 05:16:33,188 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 05:16:33,188 MicroScheduler - -> 47210 reads (0.22% of total) failing NotPrimaryAlignmentFilter INFO 05:16:33,188 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter I am wondering about those 25% reads that were "filtered" during traversal. Could you tell me if I have to be concerned about this, and what that actually means?

Thu 6 Mar 2014

Bettina_Harr


2) I am still struggling with hitting the wall time when I am trying to run Haplotype caller on the "large" mouse chromosomes, i.e. chr1. It takes more than 4 days. We have a arge computing cluster, I tried to start Java with 16 processors and 20g RAM, but the program would not run any faster. I also set the max alternate allele to 2 hoping it would save some time, but it does not. I attach my jobscript below. Is there anything else I could try to speed things up? I am now using version 3.1.1. Thanks in advance! #!/bin/sh #BSUB -q fat-long #BSUB -R scratch #BSUB -W 120:00 #BSUB -o TP7-10F1A2.recal.bamX_out #BSUB -e TP7-10F1A2.recal.bamX_error #BSUB -R 'span[hosts=1]' #BSUB -n 16 java -Xmx20g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF

Thu 6 Mar 2014

Geraldine_VdAuwera


Hi @Bettina_Harr‌, 1) HaplotypeCaller applies more stringent quality filters than our other tools. Any reads that have a mapping quality less than 20 (if I remember correctly) will be ignored. In your case it looks like many reads have a mapping quality that's considered too low to be useful, unfortunately. This depends mainly on the aligner you used and the quality of your data. 2) You can try using a more aggressive pruning setting (see [here](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html#--minPruning)) and/or adding a layer of multithreading if your cluster allows it (see [here](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html#-nct)). In addition, if you this is exome data, you can provide a list of capture targets using the -L argument.

Thu 6 Mar 2014

Bettina_Harr


thank you Geraldine. One more short question: does it make sense to set anything for the -Xmx parameter?

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, fwi, the -nt option does not seem to be supported in the Haplotype caller. I am testing the -nct option now. INFO 01:45:01,617 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,620 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 01:45:01,621 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 01:45:01,621 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 01:45:01,626 HelpFormatter - Program Args: -T HaplotypeCaller -nt 2 -nct 4 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF INFO 01:45:01,630 HelpFormatter - Executing as bharr@gwda036 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18. INFO 01:45:01,631 HelpFormatter - Date/Time: 2014/04/23 01:45:01 INFO 01:45:01,631 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,632 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:02,319 GenomeAnalysisEngine - Strictness is SILENT INFO 01:45:02,461 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 01:45:02,471 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 01:45:02,522 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 INFO 01:45:02,595 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 01:45:02,604 IntervalUtils - Processing 171031299 bp from intervals INFO 01:45:02,617 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 4 CPU thread(s) for each of 2 data thread(s), of 64 processors available on this machine ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis HaplotypeCaller currently does not support parallel execution with nt. Please run your analysis without the nt option. ##### ERROR ------------------------------------------------------------------------------------------

Thu 6 Mar 2014

Geraldine_VdAuwera


@Bettina_Harr‌ If you're using -nct (which is what I meant to recommend, sorry if that wasn't clear) you may want to increases memory allocation, yes. see the FAQs on multithreading for some general guidelines.

Thu 6 Mar 2014

Bettina_Harr


unfortunately, I am not very experienced with parallelism. I ran the code overnight WITHOUT a special -Xmx parameter (I did not know what reasonable value I should use in the multi-processing mode). It runs fine for about 5h and then exists with "exit code 1). The error file contains this at the end. Do you know what could be the problem? Simply ran out of memory? I used -ntc 6 and reserved 10 processors in my jobscript, to be on the safe side. INFO 01:42:29,369 ProgressMeter - 1:85086078 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h INFO 01:43:29,371 ProgressMeter - 1:85087145 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:443) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:417) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.calculateGLsForThisEvent(GenotypingEngine.java:385) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:222) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------

Thu 6 Mar 2014

Geraldine_VdAuwera


This looks like it might be a bug, actually. If you can isolate the region in which this issue occurred and produce a test case, we'll have a look and try to fix it. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, ok, I perfomed the following test: the program generated the error after position 85299925 on chromosome 1 (see error below). Therefore I re-ran it with the L option (-L 1:85298925-85309925 ): java -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -L 1:85298925-85309925 -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 16B.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o 16B.recal.bam1.gVCF1 This generated no error and a normal looking gVCF file. I think it is not a bug in itself, as the error only arises with chromosome 1 files, and not with any other chromosome of the same individual mouse. I have now rerun the chr1 files with the -Xmx option set to 20g, and reserving 20 processors for it (each processor has 4Gb memory allocated to it. The run is still waiting in the queue, so I will only be able to say what happened tomorrow morning. When I start the program interactively, and then use "top" to monitor the memory I can see it uses 18.6 Gb. Maybe -Xmx20g is still not enough? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23772 bharr 20 0 18.6g 842m 11m S 781.1 1.3 0:16.51 java error file: INFO 22:32:03,103 ProgressMeter - 1:85296437 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h INFO 22:32:33,105 ProgressMeter - 1:85299925 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h WARN 22:32:33,427 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at 1:85287439 has 4 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.addMiscellaneousAllele(GenotypingEngine.java:257) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:227) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

Thu 6 Mar 2014

Geraldine_VdAuwera


Ack, this is worse than I thought. "ConcurrentModificationException" means this is most likely a problem with multithreading; it seems several threads may be trying to modify the same information at the same time, which is bad and causes the crash you see. This might be triggered by something in your data. At this point I would recommend not using multithreading at all with HaplotypeCaller. Instead of multithreading you may want to run HaplotypeCaller individually per chromosome, then concatenate the resulting variants (e.g. using the CatVariants utility). This will enable you to still use your cluster for parallelism without running into this thread safety issue.

Thu 6 Mar 2014

Kurt


I've been wondering about that. I've been using both -nct and -pairHMM VECTOR blah when running in gvcf mode and I have been experiencing random crashes that aren't reproducible (I run it again and it goes through fine) on human exomes. About 1 or 2 crashes per 100 runs so far. My stack trace is not the same as above...I haven't had time to look at all of them, but below is my latest crash that I've already rerun through...I'm actually going to do a sanity check through my last 150 to see if there were any that crashed that I missed. `##### ERROR stack trace java.lang.NullPointerException at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:51) at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41) at java.util.TimSort.binarySort(TimSort.java:265) at java.util.TimSort.sort(TimSort.java:208) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.broadinstitute.sting.utils.sam.ReadUtils.sortReadsByCoordinate(ReadUtils.java:320) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.finalizeActiveRegion(HaplotypeCaller.java:1111) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1001) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:808) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) `

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, I am already running everything on a per chromosome basis and this is were I hit my runtime limits with chromosome 1. It takes more than four days, even with allowing for 20g when I start Java. It seems that maybe I have to subdivide chromsome 1 in chuncks to be able to do it :-(!

Thu 6 Mar 2014

Geraldine_VdAuwera


Wait - you're saying it's taking four days per chromosome per sample? What kind of coverage do you have?

Thu 6 Mar 2014

Bettina_Harr


medium coverage ~20 fold.

Thu 6 Mar 2014

Bettina_Harr


these are my CPU times (in seconds): TP7-10F1A2.recal.bamMT_out: CPU time : 73 TP7-10F1A2.recal.bamX_out: CPU time : 26442 TP7-10F1A2.recal.bamY_out: CPU time : 42362 TP7-10F1A2.recal.bam19_out: CPU time : 45158 TP7-10F1A2.recal.bam15_out: CPU time : 70149 TP7-10F1A2.recal.bam18_out: CPU time : 91582 TP7-10F1A2.recal.bam16_out: CPU time : 113535 TP7-10F1A2.recal.bam14_out: CPU time : 116572 TP7-10F1A2.recal.bam12_out: CPU time : 123313 TP7-10F1A2.recal.bam17_out: CPU time : 132176 TP7-10F1A2.recal.bam3_out: CPU time : 133359 TP7-10F1A2.recal.bam7_out: CPU time : 143940 TP7-10F1A2.recal.bam8_out: CPU time : 145699 TP7-10F1A2.recal.bam6_out: CPU time : 148685 TP7-10F1A2.recal.bam11_out: CPU time : 150079 TP7-10F1A2.recal.bam9_out: CPU time : 153481 TP7-10F1A2.recal.bam10_out: CPU time : 160582 TP7-10F1A2.recal.bam13_out: CPU time : 163721 TP7-10F1A2.recal.bam5_out: CPU time : 196669 TP7-10F1A2.recal.bam2_out: CPU time : 253768 TP7-10F1A2.recal.bam4_out: CPU time : 258922 TP7-10F1A2.recal.bam1_out: CPU time : 355144

Thu 6 Mar 2014

Geraldine_VdAuwera


That seems awfully slow. Unfortunately I can't really help you deal with performance and platform issues as that is beyond the scope of support I can provide, sorry.

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, ok, I got it to work. With this configuration, the whole chromosome 1 takes a little more than 1 day. So it is not a bug after all using the -nct option. One just seems to need tons of memory. #!/bin/sh #BSUB -q fat-long #BSUB -R scratch #BSUB -W 120:00 #BSUB -o 14.recal.bam1_out #BSUB -e 14.recal.bam1_error #BSUB -R 'span[hosts=1]' #BSUB -n 20 java -jar -Xmx40g /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 14.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L 1 --variant_index_type LINEAR --variant_index_parameter 128000 -o 14.recal.bam1.gVCF

Thu 6 Mar 2014

mike_boursnell


Hi. Can we use this gVCF method to analyse a bunch of BAM files all from the same sample (just so they can be analysed in parallel?) If so how would you have to set up the Read Group information in these BAM files?

Thu 6 Mar 2014

Geraldine_VdAuwera


@mike_boursnell I'm not sure what you mean, but if you're saying you want to compare the BAMs as if they belonged to different samples (e.g. to compare results on technical replicates), then all you need to do is give them different SM tags so that GATK treats them separate when you do the GenotypeGVCFs step.

Thu 6 Mar 2014

mike_boursnell


This is for whole genome sequencing of a single sample. We are looking at splitting the single FASTQ file into (say) 10 smaller bits and then processing them in parallel. When it gets to the HaplotypeCaller gVCFs stage there will be 10 separate BAM files, but which are all actually the same sample. If we give them the same SM tag will GATK assume they are the same sample and produce a single-column VCF file?

Thu 6 Mar 2014

Geraldine_VdAuwera


@mike_boursnell‌ Oh OK, got it. Yes, you should be able to just pass in the separate bam files. As long as they have the same SM tag, the data will get re-aggregated by the engine and served appropriately to the HC, leading to a single-sample VCF file.

Thu 6 Mar 2014

Bettina_Harr


Hi Geraldine, I have now completed the variant calling using Haplotype Caller and recalibrated my variants. It all worked but I am now in a position where my outfiles contain ONLY variable sites, and not the invariable ones. The vcf files that "GenotypeGVCFs" produces does no longer contain the invariable sites. I would like to have the info on all sites in the final vcf file, even for those sites that are not variable as Id like to see their coverage. is this possible at all at this point? Thanks in advance!

Thu 6 Mar 2014

Geraldine_VdAuwera


Hi @Bettina_Harr‌, Yes, you can get that using the `-inv` argument with GenotypeGVCFs. However, in the current version this will produce no-call site records to be emitted, so the hom-ref sites won't have much information attached to them. I believe the next version will produce detailed records at all sites.

Thu 6 Mar 2014

Bettina_Harr


thank you Geraldine! I have another question. The R script does not automatically run on my machine, so I was trying to run the script manually by typing: library(ggplot2) > source("7.recalibrate_SNP_plots.R") Error: 'opts' is deprecated. Use 'theme' instead. (Defunct; last used in version 0.9.1) I am getting the error above. I am using ggplot2 1.0.0. R version 3.1.0 (2014-04-10) -- "Spring Dance" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.1.0 (64-bit) Simply replacing "opts" with "theme" does not do the trick, I will get another error then: Error: theme_rect is deprecated. Use 'element_rect' instead. (Defunct; last used in version 0.9.1) Thanks! Bettina

Thu 6 Mar 2014

Geraldine_VdAuwera


Hi @Bettina_Harr‌, There are a few other functions related to the opt -> theme change that also need to be adapted. The error message tells you which. Just step through the script and change every one that throws an error the same way.

Thu 6 Mar 2014





At a glance



Follow us on Twitter

GATK Dev Team

@gatk_dev

RT @NJL_NGS: I just opened an new position. Join our group at Broad and help us support both clinical and research projects. https://t.co
1 Dec 16
RT @LalDennis: Friday in the Online Journal Club "Hail: Scalable genetic analyses allowing to tackle ultra large data". All infos https://t…
1 Dec 16
Looking forward to another year of #GATK workshops around the world, taking invitations for 2017 https://t.co/0sCdvGStaH
24 Nov 16
Thinking about pipelining #GATK with WDL? Take a tour of WDL resources by @WDL_dev: https://t.co/5zbWyn038s
21 Nov 16
RT @broadinstitute: You can watch @dgmacarthur's #MPGPrimer on #ExAC, #gnomAD on YouTube: https://t.co/O5y5tBT6Eo
19 Nov 16

Our favorite tweets from others

Currently in a time-out for saying that duck fat had a certain "je ne sais quack" at the thanksgiving dinner table.
25 Nov 16
@dgmacarthur @BioMickWatson @StevenNHart @splon There's even a shop near Broad that apparently fixes Hail code erro… https://t.co/IZ4BcgRZYE
19 Nov 16
I'm very happy to be at GATK Workshop hands on with @gatk_dev ! There's always something to learn. Tip: there is free coffee. Lol
8 Nov 16
Have recently begun to think of slide editing as ‘downsampling my slides’. I suspect this indicates something is wrong with me.
1 Nov 16
See more of our favorite tweets...
Search blog by tag

appistry ashg ashg16 benchmarks best-practices bug bug-fixed cancer cloud cluster cnv collaboration commandline commandlinegatk community compute conferences cram cromwell denovo depthofcoverage diagnosetargets error fix forum gatk3 gatk4 genotype genotype-refinement genotypegvcfs google grch38 gvcf haploid haplotypecaller hg38 holiday hts htsjdk ibm intel java8 job job-offer jobs license meetings mendelianviolations mutect mutect2 ngs outreach pairhmm parallelism patch performance phone-home picard pipeline plans ploidy polyploid poster presentations printreads profile promote release release-notes rnaseq runtime saas script search sequencing service slides snow speed status sting support syntax talks team terminology third-party-tools topstory troll tutorial unifiedgenotyper vcf-gz version-highlights versions vqsr wdl webinar workflow workshop