Showing tool doc from version 4.1.1.0 | The latest version is 4.1.1.0

CheckFingerprint (Picard)

Checks the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF) against a set of known genotypes in the supplied genotype file (in VCF format).

Summary

Computes a fingerprint (essentially, genotype information from different parts of the genome) from the supplied input file (SAM/BAM or VCF) file and compares it to the expected fingerprint genotypes provided. The key output is a LOD score which represents the relative likelihood of the sequence data originating from the same sample as the genotypes vs. from a random sample.
Two outputs are produced:
  1. A summary metrics file that gives metrics of the fingerprint matches when comparing the input to a set of genotypes for the expected sample. At the single sample level (if the input was a VCF) or at the read level (lane or index within a lane) (if the input was a SAM/BAM)
  2. A detail metrics file that contains an individual SNP/Haplotype comparison within a fingerprint comparison.
The metrics files fill the fields of the classes FingerprintingSummaryMetrics and FingerprintingDetailMetrics. The output files may be specified individually using the SUMMARY_OUTPUT and DETAIL_OUTPUT options. Alternatively the OUTPUT option may be used instead to give the base of the two output files, with the summary metrics having a file extension ".fingerprinting_summary_metrics", and the detail metrics having a file extension ".fingerprinting_detail_metrics".

Example comparing a bam against known genotypes:

     java -jar picard.jar CheckFingerprint \
          INPUT=sample.bam \
          GENOTYPES=sample_genotypes.vcf \
          HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
          OUTPUT=sample_fingerprinting 

Detailed Explanation

This tool calculates a single number that reports the LOD score for identity check between the INPUT and the GENOTYPES. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes. The identity check makes use of haplotype blocks defined in the HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage). When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.

Category Diagnostics and Quality Control


Overview

Checks the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF) against a set of known genotypes in the supplied genotype file (in VCF format).

Summary

Computes a fingerprint (essentially, genotype information from different parts of the genome) from the supplied input file (SAM/BAM or VCF) file and compares it to the expected fingerprint genotypes provided. The key output is a LOD score which represents the relative likelihood of the sequence data originating from the same sample as the genotypes vs. from a random sample.
Two outputs are produced:
  1. A summary metrics file that gives metrics of the fingerprint matches when comparing the input to a set of genotypes for the expected sample. At the single sample level (if the input was a VCF) or at the read level (lane or index within a lane) (if the input was a SAM/BAM)
  2. A detail metrics file that contains an individual SNP/Haplotype comparison within a fingerprint comparison.
The metrics files fill the fields of the classes FingerprintingSummaryMetrics and FingerprintingDetailMetrics. The output files may be specified individually using the SUMMARY_OUTPUT and DETAIL_OUTPUT options. Alternatively the OUTPUT option may be used instead to give the base of the two output files, with the summary metrics having a file extension #FINGERPRINT_SUMMARY_FILE_SUFFIX, and the detail metrics having a file extension #FINGERPRINT_DETAIL_FILE_SUFFIX.

Example comparing a bam against known genotypes:

     java -jar picard.jar CheckFingerprint \
          INPUT=sample.bam \
          GENOTYPES=sample_genotypes.vcf \
          HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
          OUTPUT=sample_fingerprinting
 

Detailed Explanation

This tool calculates a single number that reports the LOD score for identity check between the #INPUT and the #GENOTYPES. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes.

The identity check makes use of haplotype blocks defined in the #HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage).

When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.

CheckFingerprint (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--DETAIL_OUTPUT
 -D
null The text file to which to write detail metrics.
--GENOTYPES
 -G
null File of genotypes (VCF) to be used in comparison. May contain any number of genotypes; CheckFingerprint will use only those that are usable for fingerprinting.
--HAPLOTYPE_MAP
 -H
null The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details.
--INPUT
 -I
null Input file SAM/BAM or VCF. If a VCF is used, it must have at least one sample. If there are more than one samples in the VCF, the parameter OBSERVED_SAMPLE_ALIAS must be provided in order to indicate which sample's data to use. If there are no samples in the VCF, an exception will be thrown.
--OUTPUT
 -O
null The base prefix of output files to write. The summary metrics will have the file extension 'fingerprinting_summary_metrics' and the detail metrics will have the extension 'fingerprinting_detail_metrics'.
--SUMMARY_OUTPUT
 -S
null The text file to which to write summary metrics.
Optional Tool Arguments
--arguments_file
[] read one or more arguments files and add them to the command line
--EXPECTED_SAMPLE_ALIAS
 -SAMPLE_ALIAS
null This parameter can be used to specify which sample's genotypes to use from the expected VCF file (the GENOTYPES file). If it is not supplied, the sample name from the input (VCF or BAM read group header) will be used.
--GENOTYPE_LOD_THRESHOLD
 -LOD
5.0 When counting haplotypes checked and matching, count only haplotypes where the most likely haplotype achieves at least this LOD.
--help
 -h
false display the help message
--IGNORE_READ_GROUPS
 -IGNORE_RG
false If the input is a SAM/BAM, and this parameter is true, treat the entire input BAM as one single read group in the calculation, ignoring RG annotations, and producing a single fingerprint metric for the entire BAM.
--OBSERVED_SAMPLE_ALIAS
null If the input is a VCF, this parameters used to select which sample's data in the VCF to use.
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS
client_secrets.json Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE
 -R
null Reference sequence file.
--TMP_DIR
[] One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean  false


--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--DETAIL_OUTPUT / -D

The text file to which to write detail metrics.

Exclusion: This argument cannot be used at the same time as OUTPUT.

R File  null


--EXPECTED_SAMPLE_ALIAS / -SAMPLE_ALIAS

This parameter can be used to specify which sample's genotypes to use from the expected VCF file (the GENOTYPES file). If it is not supplied, the sample name from the input (VCF or BAM read group header) will be used.

String  null


--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String  client_secrets.json


--GENOTYPE_LOD_THRESHOLD / -LOD

When counting haplotypes checked and matching, count only haplotypes where the most likely haplotype achieves at least this LOD.

double  5.0  [ [ -∞  ∞ ] ]


--GENOTYPES / -G

File of genotypes (VCF) to be used in comparison. May contain any number of genotypes; CheckFingerprint will use only those that are usable for fingerprinting.

R String  null


--HAPLOTYPE_MAP / -H

The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details.

R File  null


--help / -h

display the help message

boolean  false


--IGNORE_READ_GROUPS / -IGNORE_RG

If the input is a SAM/BAM, and this parameter is true, treat the entire input BAM as one single read group in the calculation, ignoring RG annotations, and producing a single fingerprint metric for the entire BAM.

boolean  false


--INPUT / -I

Input file SAM/BAM or VCF. If a VCF is used, it must have at least one sample. If there are more than one samples in the VCF, the parameter OBSERVED_SAMPLE_ALIAS must be provided in order to indicate which sample's data to use. If there are no samples in the VCF, an exception will be thrown.

R String  null


--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--OBSERVED_SAMPLE_ALIAS / NA

If the input is a VCF, this parameters used to select which sample's data in the VCF to use.

String  null


--OUTPUT / -O

The base prefix of output files to write. The summary metrics will have the file extension 'fingerprinting_summary_metrics' and the detail metrics will have the extension 'fingerprinting_detail_metrics'.

Exclusion: This argument cannot be used at the same time as SUMMARY_OUTPUT, DETAIL_OUTPUT, S, D.

R String  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--REFERENCE_SEQUENCE / -R

Reference sequence file.

File  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--SUMMARY_OUTPUT / -S

The text file to which to write summary metrics.

Exclusion: This argument cannot be used at the same time as OUTPUT.

R File  null


--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.1.1.0 built at Wed, 3 Apr 2019 09:19:24 -0400.