You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!
Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:
2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38
You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?
This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.
We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.
Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.
|Reads||Likelihood of A||Likelihood of T|
Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.
Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.
|Reads||Phred-scaled likelihood of A||Phred-scaled likelihood of T|
Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.
To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.
How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.
So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.
Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.
In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.
When you run AnalyzeCovariates to analyze your BQSR outputs, you may encounter an error starting with this line:
org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
The main reason why this error often occurs is simple, and so is the solution. The script depends on some external R libraries, so if you don't have them installed, the script fails. To find out what libraries are necessary and how to install them, you can refer to this tutorial.
One other common issue is that the version of ggplot2 you have installed is very recent and is not compatible with the BQSR script. If so, download this Rscript file and use it to generate the plots manually according to the instructions below.
If you have already checked that you have all the necessary libraries installed, you'll need to run the script manually in order to find out what is wrong. To new users, this can seem complicated, but it only takes these 3 simple steps to do it!
-l DEBUG(that's a lowercase L, not an uppercase i, to be clear) and
-csv my-report.csv(where you can call the .csv file anything; this is so the intermediate csv file will be saved).
The snippet below shows you the components of the R script command line that AnalyzeCovariates uses.
INFO 18:04:55,355 AnalyzeCovariates - Generating plots file 'RTest.pdf' DEBUG 18:04:55,672 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/utils/recalibration/BQSR.R /Users/schandra/BQSR_Testing/RTest.csv /Users/schandra/BQSR_Testing/RTest.recal /Users/schandra/BQSR_Testing/RTest.pdf DEBUG 18:04:55,687 RScriptExecutor - Executing: DEBUG 18:04:55,688 RScriptExecutor - Rscript DEBUG 18:04:55,688 RScriptExecutor - -e DEBUG 18:04:55,688 RScriptExecutor - tempLibDir = '/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/Rlib.2085451458391709180';source('/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/BQSR.761775214345441497.R'); DEBUG 18:04:55,689 RScriptExecutor - /Users/schandra/BQSR_Testing/RTest.csv DEBUG 18:04:55,689 RScriptExecutor - /Users/schandra/BQSR_Testing/RTest.recal DEBUG 18:04:55,689 RScriptExecutor - /Users/schandra/BQSR_Testing/RTest.pdf
So, your full command line will be:
RScript BQSR.R RTest.csv RTest.recal RTest.pdf
For new users, the easiest way to do this is to do it from within an IDE program like RStudio. Or, you can start up R at the command line and run it that way, whatever you are comfortable with.
This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.
So what do you do?
You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:
java -jar picard.jar ReorderSam \ I=original.bam \ O=reordered.bam \ R=reference.fasta \ CREATE_INDEX=TRUE
reference.fasta is your genome reference, which must be accompanied by a valid
*.dict dictionary file. The
CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).
Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!
You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:
java -jar picard.jar SortVcf \ I=original.vcf \ O=sorted.vcf \ SEQUENCE_DICTIONARY=reference.dict
reference.dict is the sequence dictionary of your genome reference.
Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.
In version 3.5, we added some beefed-up VCF sequence dictionary validation. Unfortunately, as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.
These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.
The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference.
A very common case we see looks like this:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571. ##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:
Input files reads and reference have incompatible contigs
It further tells us that the contig length doesn't match for the chrM contig:
Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571.
This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.
We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.
Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using
-L) with something like this:
#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.
If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.
If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).
The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?
Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.
So only try that if you know what you're doing. YMMV.
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths: ERROR contig known = chrM / 16569 ERROR contig reference = chrM / 16571.
Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:
Input files known and reference have incompatible contigs
We know (trust me) that this is the output of a RealignerTargetCreator command, so the known file must be the VCF file provided through the
known argument. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.
If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).
location: ftp.broadinstitute.org username: gsapubftp-anonymous
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.
GATK used to include some liftover utilities (documented below for the record) but we no longer support them.
This procedure involves three steps:
We provide a script that performs those three steps for you, called
liftOverVCF.pl, which is available in our public source repository -- but you have to check out a version older than 3.4 -- under the 'perl' directory. Instructions for pulling down our source code from github are available here.
The example below shows how you would run the script:
./liftOverVCF.pl \ -vcf calls.b36.vcf \ # input vcf -chain b36ToHg19.broad.over.chain \ # chain file -out calls.hg19.vcf \ # output vcf -gatk gatk_source \ # path to source code -newRef Homo_sapiens_assembly19 \ # path to new reference base name (without extension) -oldRef human_b36_both \ # path to old reference prefix (without extension) -tmp /broad/shptmp [defaults to /tmp] # temp file location (defaults to /tmp)
We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the
Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.
Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the
You get an error like this:
SAM/BAM/CRAM file <filename> appears to be using the wrong encoding for quality scores
The standard format for quality score encodings is that Q0 == ASCII 33 according to the SAM specification. However, in some datasets (including older Illumina data), encoding starts at ASCII 64. This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off.
To prevent this from happening, the GATK engine performs a sanity check of the quality score encodings that will abort the program run if they are not standard (since version 2.3), and output the error message shown above.
If this happens to you, you'll need to run again with the flag
-fixMisencodedQuals. What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.
In some cases the data contains a mix of encodings (which is likely to arise if you're passing in a lot of different files from different sources together), and the GATK can't automatically compensate for that. There is an argument you can use to override this check:
--allow_potentially_misencoded_quality_scores; but you use it at your own risk. We strongly encourage you to check the encodings of your files rather than use this option.
See the Dictionary entry on read groups.
As detailed in the FAQs about input requirements, GATK expects all read groups appearing in the read data to be specified in the file header, and will fail with an error if it does not find that information (whether there is no read group information in the file, or a subset of reads do not have read groups).
Typically you should read group information when you perform the original alignments (with e.g. BWA, which has an option to do so). So what do you do if you forgot to do that, and you don't want to have to rerun BWA all over again?
You can use a Picard tool called AddOrReplaceReadGroups to add the missing information to your input file.
Here's an example:
# throws an error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_without_RG.bam \ -o output.vcf # fix the read groups java -jar picard.jar AddOrReplaceReadGroups \ I= reads_without_RG.bam \ O= reads_with_RG.bam \ SORT_ORDER=coordinate \ RGID=foo \ RGLB=bar \ RGPL=illumina \ RGSM=Sample1 \ CREATE_INDEX=True # runs without error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_with_RG.bam \ -o output.vcf
Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need.
You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following:
Error Type Count,
NullPointerException -- or maybe something else that doesn't mean anything to you.
The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.
The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.
We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.
In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!
First, run ValidateSamFile in
SUMMARY mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set
MODE=SUMMARY explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ MODE=SUMMARY
If this outputs
No errors found, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like
SM that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have
SM tags in the
@RG lines in their BAM header. That is the most common culprit.
However, if the command above outputs one or more of the 8 possible
WARNING or 48 possible
ERROR messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.
When run in
SUMMARY mode, ValidateSamFile outputs a table that differentiates between two levels of error:
ERROR proper and
WARNING, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the
ERROR category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the
WARNING category may often be ignored for some, if not all subsequent analyses.
This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1
MISSING_READ_GROUP error, 4
MISMATCH_MATE_ALIGNMENT_START errors, 894,289
MATES_ARE_SAME_END errors, and so on. Moreover, this output also indicates that there are 54
RECORD_MISSING_READ_GROUP warnings and 33
ERRORs are more severe than
WARNINGs, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE_WARNINGS=true \ MODE=VERBOSE
Note that we invoked the
MODE=VERBOSE and the
The former is technically not necessary as
VERBOSE is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of
ERROR than is given in the
IGNORE_WARNINGS option enables us to specifically examine only the records with
ERRORs. When working with large files, this feature can be quite helpful, because there may be many records with
WARNINGs that are not immediately important, and we don't want them flooding the log output.
|ValidateSamFile (VERBOSE)||Error Description|
|ERROR: Read groups is empty||Empty read group field for multiple records|
|ERROR: Record 1, Read name 20FUKAAXX100202:6:27:4968:125377||Mate alignment does not match alignment start of mate|
|ERROR: Record 3, Read name 20FUKAAXX100202:6:27:4986:125375||Both mates are marked as second of pair|
|ERROR: Record 6, Read name 20GAVAAXX100126:4:47:18102:194445||Read CIGAR M operator maps off end of reference|
|ERROR: Read name 30PPJAAXX090125:1:60:1109:517#0||Mate not found for paired read|
|ERROR: Record 402, Read name 20GAVAAXX100126:3:44:17022:23968||Mate unmapped flag does not match read unmapped flag of mate|
|ERROR: Record 12, Read name HWI-ST1041:151:C7BJEACXX:1:1101:1128:82805||Read length does not match quals length|
ERRORs are all problems that we must address before using this BAM file as input for further analysis. Most
ERRORs can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.
MISSING_READ_GROUP errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.
Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in
SUMMARY mode again. We do this to evaluate whether our attempted fix has solved the original
ERRORs, and/or any of the original
WARNINGs, and/or introduced any new
WARNINGs (sadly, this does happen).
If you still have
ERRORs, you'll have to loop through this part of the workflow until no more
ERRORs are detected.
If you have no more
ERRORs, congratulations! It's time to look at the
WARNINGs (assuming there are still some -- if not, you're off to the races).
To obtain more detailed information about the warnings, we invoke the following command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE=type \ MODE=VERBOSE
At this time we often use the
IGNORE option to tell the program to ignore a specific type of
WARNING that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some
WARNINGs at all because we know they are harmless (for example,
MATE_NOT_FOUND warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.
|ValidateSamFile (VERBOSE)||Warning Description|
|WARNING: Read name H0164ALXX140820:2:1204:13829:66057||A record is missing a read group|
|WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776||NM tag (nucleotide differences) is missing|
Here we see a read group-related
WARNING which would probably be fixed when we fix the
MISSING_READ_GROUP error we encountered earlier, hence the prioritization strategy of tackling
ERRORs first and
We also see a
WARNING about missing
NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding
IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file.
Once you have attempted to fix all the
WARNINGs that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in
SUMMARY mode. Again, we check that no new
ERRORs have been introduced and that the only
WARNINGs that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.
We are currently in the process of updating the Picard website to include the following two tables, describing
WARNING (Table I) and
ERROR (Table II) cases. Until that's done, you can find them here.
|INVALID_DATE_STRING||Date string is not ISO-8601|
|INVALID_QUALITY_FORMAT||Quality encodings out of range; appear to be Solexa or Illumina when Phred expected. Avoid exception being thrown as a result of no qualities being read.|
|General Alignment Record Issues|
|ADJACENT_INDEL_IN_CIGAR||CIGAR string contains an insertion (I) followed by deletion (D), or vice versa|
|RECORD_MISSING_READ_GROUP||A SAMRecord is found with no read group id|
|Mate Pair Issues|
|PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND||Pair flag set but not marked as first or second of pair|
|Optional Alignment Tag Issues|
|MISSING_TAG_NM||The NM tag (nucleotide differences) is missing|
|E2_BASE_EQUALS_PRIMARY_BASE||Secondary base calls should not be the same as primary, unless one or the other is N|
|General File, Index or Sequence Dictionary Issues|
|BAM_FILE_MISSING_TERMINATOR_BLOCK||BAM appears to be healthy, but is an older file so doesn't have terminator block|
|DUPLICATE_PROGRAM_GROUP_ID||Same program group id appears more than once|
|DUPLICATE_READ_GROUP_ID||Same read group id appears more than once|
|HEADER_RECORD_MISSING_REQUIRED_TAG||Header tag missing in header line|
|HEADER_TAG_MULTIPLY_DEFINED||Header tag appears more than once in header line with different value|
|INVALID_PLATFORM_VALUE||The read group has an invalid value set for its PL field|
|INVALID_VERSION_NUMBER||Does not match any of the acceptable versions|
|MISSING_HEADER||The SAM/BAM file is missing the header|
|MISSING_PLATFORM_VALUE||The read group is missing its PL (platform unit) field|
|MISSING_READ_GROUP||The header is missing read group information|
|MISSING_SEQUENCE_DICTIONARY||There is no sequence dictionary in the header|
|MISSING_VERSION_NUMBER||Header has no version number|
|POORLY_FORMATTED_HEADER_TAG||Header tag does not have colon|
|READ_GROUP_NOT_FOUND||A read group ID on a SAMRecord is not found in the header|
|UNRECOGNIZED_HEADER_TYPE||Header record is not one of the standard types|
|General Alignment Record Issues|
|CIGAR_MAPS_OFF_REFERENCE||Bases corresponding to M operator in CIGAR extend beyond reference|
|INVALID_ALIGNMENT_START||Alignment start position is incorrect|
|INVALID_CIGAR||CIGAR string error for either read or mate|
|INVALID_FLAG_FIRST_OF_PAIR||First of pair flag set for unpaired read|
|INVALID_FLAG_SECOND_OF_PAIR||Second of pair flag set for unpaired read|
|INVALID_FLAG_PROPER_PAIR||Proper pair flag set for unpaired read|
|INVALID_FLAG_MATE_NEG_STRAND||Mate negative strand flag set for unpaired read|
|INVALID_FLAG_NOT_PRIM_ALIGNMENT||Not primary alignment flag set for unmapped read|
|INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT||Supplementary alignment flag set for unmapped read|
|INVALID_FLAG_READ_UNMAPPED||Mapped read flat not set for mapped read|
|INVALID_INSERT_SIZE||Inferred insert size is out of range|
|INVALID_MAPPING_QUALITY||Mapping quality set for unmapped read or is >= 256|
|INVALID_PREDICTED_MEDIAN_INSERT_SIZE||PI tag value is not numeric|
|MISMATCH_READ_LENGTH_AND_QUALS_LENGTH||Length of sequence string and length of base quality string do not match|
|TAG_VALUE_TOO_LARGE||Unsigned integer tag value is deprecated in BAM. Template length|
|Mate Pair Issues|
|INVALID_FLAG_MATE_UNMAPPED||Mate unmapped flag is incorrectly set|
|MATE_NOT_FOUND||Read is marked as paired, but its pair was not found|
|MATE_CIGAR_STRING_INVALID_PRESENCE||A cigar string for a read whose mate is NOT mapped|
|MATE_FIELD_MISMATCH||Read alignment fields do not match its mate|
|MATES_ARE_SAME_END||Both mates of a pair are marked either as first or second mates|
|MISMATCH_FLAG_MATE_UNMAPPED||Mate unmapped flag does not match read unmapped flag of mate|
|MISMATCH_FLAG_MATE_NEG_STRAND||Mate negative strand flag does not match read strand flag|
|MISMATCH_MATE_ALIGNMENT_START||Mate alignment does not match alignment start of mate|
|MISMATCH_MATE_CIGAR_STRING||The mate cigar tag does not match its mate's cigar string|
|MISMATCH_MATE_REF_INDEX||Mate reference index (MRNM) does not match reference index of mate|
|Optional Alignment Tag Issues|
|INVALID_MATE_REF_INDEX||Mate reference index (MRNM) set for unpaired read|
|INVALID_TAG_NM||The NM tag (nucleotide differences) is incorrect|
|MISMATCH_READ_LENGTH_AND_E2_LENGTH||Lengths of secondary base calls tag values and read should match|
|MISMATCH_READ_LENGTH_AND_U2_LENGTH||Secondary base quals tag values should match read length|
|EMPTY_READ||Indicates that a read corresponding to the first strand has a length of zero and/or lacks flow signal intensities (FZ)|
|INVALID_INDEXING_BIN||Indexing bin set on SAMRecord does not agree with computed value|
|General File, Index or Sequence Dictionary Issues|
|INVALID_INDEX_FILE_POINTER||Invalid virtualFilePointer in index|
|INVALID_REFERENCE_INDEX||Reference index not found in sequence dictionary|
|RECORD_OUT_OF_ORDER||The record is out of order|
|TRUNCATED_FILE||BAM file does not have terminator block|
Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not available for organisms other than humans, and it also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study (especially for humans -- use 1000 Genomes or ExAC!) but again for non-human organisms it is often not possible to do this.
So, if this is your case and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to hard-filter your variants. To do this, you will need to compose filter expressions using JEXL as explained here based on the generic filter recommendations detailed below. There is a tutorial that shows how to achieve this step by step. Be sure to also read the documentation explaining how to understand and improve upon the generic hard filtering recommendations.
Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, what organism it belongs to, etc.
HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.
In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.
In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.
Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Be sure to read the documentation explaining how to understand and improve upon these recommendations.
Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression
QD < 2.0, any site with a QD lower than 2 will be tagged as failing that filter.
QD < 2.0
MQ < 40.0
FS > 60.0
SOR > 3.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
If your callset was generated with UnifiedGenotyper for legacy reasons, you can add
HaplotypeScore > 13.0.
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
SOR > 10.0
The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.
Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.
HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.
And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?
Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)
We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.
-A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, UnifiedGenotyper and VariantAnnotator), but that annotation did not show up in your output VCF.
Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.
There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.
For example, you're running MuTect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.
For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.
For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.
For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with
This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.
There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.
In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.
If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.
What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.
In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!
The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.
The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the
--max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.
The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the
--max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.
Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.
This is highly experimental, but if all else fails, worth a shot (with HaplotypeCaller and MuTect2).
In some difficult sequence contexts (e.g. repeat regions), when some default-sized kmers are non-unique, cycles get generated in the graph. By default the program increases the kmer size automatically to try again, but after several attempts it will eventually quit trying and fail to call the expected variant (typically because the variant gets pruned out of the read-threading assembly graph, and is therefore never assembled into a candidate haplotype). We've seen cases where it's still possible to force a resolution using
-allowNonUniqueKmersInRef and/or increasing the
--kmerSize (or range of permitted sizes: 10, 25, 35 for example).
Decreasing the value of
-minDanglingBranchLength (i.e. increasing the amount of evidence necessary to keep a path in the graph) can recover variants, at the risk of taking on more false positives.
We sometimes need to be able to use multiple versions of Java on the same computer to run command-line tools that have different version requirements. At the time of writing, GATK requires an older version of Java (1.7), whereas Picard requires the most recent version (1.8). So in order to run both Picard tools and GATK tools on your computer, we present a solution for doing so that is reasonably painless.
You will need to have both versions of Java installed on your machine. The Java installation package for 1.8 can be found here, and the package for 1.7 is here. Note that we point to the “JDK” (Java Development Kit) packages because they are the most complete Java packages (suitable for developing in Java as well as running Java executables), and we have had reports that the “JRE” (Java Runtime Environment) equivalents were not sufficient to run GATK on some machines.
First, check your current default java version by opening your terminal and typing
java -version. If the version starts with “1.8”, you will need to add the following code to the beginning of your GATK command to specify that it should be run using version 1.7.
If your default version starts with “1.7”, then you will need to prepend the code below to your Picard command:
You may need to change the orange part in each code snippet, which should refer to the specific version of java you have installed on your machine (version and update). To find that, simply navigate to the folder where you had installed the JDK. Under the “JavaVirtualMachines” folder, you should find JDK folders that name the specific version and update.