Showing tool doc from version 4.1.0.0 | The latest version is 4.1.2.0

CrosscheckFingerprints (Picard)

Checks that all data in the set of input files appear to come from the same individual. Can be used to cross-check readgroups, libraries, samples, or files. Operates on bams/sams and vcfs (including gvcfs).

Summary

Checks if all the genetic data within a set of files appear to come from the same individual. It quickly determines whether a group's genotype matches that of an input SAM/BAM/VCF by selective sampling, and has been designed to work well for low-depth SAM/BAMs (as well as high depth ones and VCFs.) The tool collects fingerprints (essentially, genotype information from different parts of the genome) at the finest level available in the data (readgroup for SAM files and sample for VCF files) and then optionally aggregates it by library, sample or file, to increase power and provide results at the desired resolution. Output is in a "Moltenized" format, one row per comparison. The results are emitted into a CrosscheckMetric metric file. In this format the output will include the LOD score and also tumor-aware LOD score which can help assess identity even in the presence of a severe loss of heterozygosity with high purity (which could cause it to otherwise fail to notice that samples are from the same individual.) A matrix output is also available to facilitate visual inspection of crosscheck results. Since there can be many rows of output in the metric file, we recommend the use of ClusterCrosscheckMetrics as a follow-up step to running CrosscheckFingerprints. There are cases where one would like to identify a few groups out of a collection of many possible groups (say to link a bam to it's correct sample in a multi-sample vcf. In this case one would not case for the cross-checking of the various samples in the VCF against each other, but only in checking the identity of the bam against the various samples in the vcf. The SECOND_INPUT is provided for this use-case. With SECOND_INPUT provided, CrosscheckFingerprints does the following: - aggregation of data happens independently for the input files in INPUT and SECOND_INPUT. - aggregation of data happens at the SAMPLE level - each samples from INPUT will only be compared to that same sample in SECOND_INPUT. - MATRIX_OUTPUT is disabled.

Examples

Check that all the readgroups from a sample match each other:

    java -jar picard.jar CrosscheckFingerprints \
          INPUT=sample.with.many.readgroups.bam \
          HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
          LOD_THRESHOLD=-5 \
          OUTPUT=sample.crosscheck_metrics 

Check that all the readgroups match as expected when providing reads from two samples from the same individual:

     java -jar picard.jar CrosscheckFingerprints \
           INPUT=sample.one.with.many.readgroups.bam \
           INPUT=sample.two.with.many.readgroups.bam \
           HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
           LOD_THRESHOLD=-5 \
           EXPECT_ALL_GROUPS_TO_MATCH=true \
           OUTPUT=sample.crosscheck_metrics 

Detailed Explanation

This tool calculates the LOD score for identity check between "groups" of data in the INPUT files as defined by the CROSSCHECK_BY argument. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes. Each group is assigned a sample identifier (for SAM this is taken from the SM tag in the appropriate readgroup header line, for VCF this is taken from the column label in the file-header. After combining all the data from the same group together, an all-against-all comparison is performed. Results are categorized as one of EXPECTED_MATCH, EXPECTED_MISMATCH, UNEXPECTED_MATCH, UNEXPECTED_MISMATCH, or AMBIGUOUS depending on the LOD score and on whether the sample identifiers of the groups agree: LOD scores that are less than LOD_THRESHOLD are considered mismatches, and those greater than -LOD_THRESHOLD are matches (between is ambiguous). If the sample identifiers are equal, the groups are expected to match. They are expected to mismatch otherwise. The identity check makes use of haplotype blocks defined in the HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage). When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.

Category Diagnostics and Quality Control


Overview

Checks that all data in the set of input files appear to come from the same individual. Can be used to compare according to readgroups, libraries, samples, or files. Operates on bams/sams and vcfs (including gvcfs).

Summary

Checks if all the genetic data within a set of files appear to come from the same individual. It quickly determines whether a "group's" genotype matches that of an input SAM/BAM/VCF by selective sampling, and has been designed to work well even for low-depth SAM/BAMs.
The tool collects "fingerprints" (essentially genotype information from different parts of the genome) at the finest level available in the data (readgroup for SAM files and sample for VCF files) and then optionally aggregates it by library, sample or file, to increase power and provide results at the desired resolution. Output is in a "Moltenized" format, one row per comparison. The results will be emitted into a metric file for the class CrosscheckMetric. In this format the output will include the LOD score and also tumor-aware LOD score which can help assess identity even in the presence of a severe loss of heterozygosity with high purity (which could otherwise fail to notice that samples are from the same individual.) A matrix output is also available to facilitate visual inspection of crosscheck results.
Since there can be many rows of output in the metric file, we recommend the use of ClusterCrosscheckMetrics as a follow-up step to running CrosscheckFingerprints.
There are cases where one would like to identify a few groups out of a collection of many possible groups (say to link a bam to it's correct sample in a multi-sample vcf. In this case one would not case for the cross-checking of the various samples in the VCF against each other, but only in checking the identity of the bam against the various samples in the vcf. The #SECOND_INPUT is provided for this use-case. With #SECOND_INPUT provided, CrosscheckFingerprints does the following:
  • aggregation of data happens independently for the input files in #INPUT and #SECOND_INPUT.
  • aggregation of data happens at the SAMPLE level.
  • each samples from #INPUT will only be compared to that same sample in #INPUT.
  • #MATRIX_OUTPUT is disabled.

  • Examples

    Check that all the readgroups from a sample match each other:

         java -jar picard.jar CrosscheckFingerprints \
              INPUT=sample.with.many.readgroups.bam \
              HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
              LOD_THRESHOLD=-5 \
              OUTPUT=sample.crosscheck_metrics
     

    Check that all the readgroups match as expected when providing reads from two samples from the same individual:

         java -jar picard.jar CrosscheckFingerprints \
              INPUT=sample.one.with.many.readgroups.bam \
              INPUT=sample.two.with.many.readgroups.bam \
              HAPLOTYPE_DATABASE=fingerprinting_haplotype_database.txt \
              LOD_THRESHOLD=-5 \
              EXPECT_ALL_GROUPS_TO_MATCH=true \
              OUTPUT=sample.crosscheck_metrics
     

    Detailed Explanation

    This tool calculates the LOD score for identity check between "groups" of data in the INPUT files as defined by the CROSSCHECK_BY argument. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes. Each group is assigned a sample identifier (for SAM this is taken from the SM tag in the appropriate readgroup header line, for VCF this is taken from the column label in the file-header. After combining all the data from the same "group" together, an all-against-all comparison is performed. Results are categorized a FingerprintResult enum: EXPECTED_MATCH, EXPECTED_MISMATCH, UNEXPECTED_MATCH, UNEXPECTED_MISMATCH, or AMBIGUOUS depending on the LOD score and on whether the sample identifiers of the groups agree: LOD scores that are less than LOD_THRESHOLD are considered mismatches, and those greater than -LOD_THRESHOLD are matches (between is ambiguous). If the sample identifiers are equal, the groups are expected to match. They are expected to mismatch otherwise.
    The identity check makes use of haplotype blocks defined in the HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage).
    When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.

    CrosscheckFingerprints (Picard) specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Arguments
    --HAPLOTYPE_MAP
     -H
    null The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details.
    --INPUT
     -I
    [] One or more input files (or lists of files) with which to compare fingerprints.
    Optional Tool Arguments
    --ALLOW_DUPLICATE_READS
    false Allow the use of duplicate reads in performing the comparison. Can be useful when duplicate marking has been overly aggressive and coverage is low.
    --arguments_file
    [] read one or more arguments files and add them to the command line
    --CALCULATE_TUMOR_AWARE_RESULTS
    true specifies whether the Tumor-aware result should be calculated. These are time consuming and can roughly double the runtime of the tool. When crosschecking many groups not calculating the tumor-aware results can result in a significant speedup.
    --CROSSCHECK_BY
    READGROUP Specificies which data-type should be used as the basic comparison unit. Fingerprints from readgroups can be "rolled-up" to the LIBRARY, SAMPLE, or FILE level before being compared. Fingerprints from VCF can be be compared by SAMPLE or FILE.
    --CROSSCHECK_MODE
    CHECK_SAME_SAMPLE An argument that controls how crosschecking with both INPUT and SECOND_INPUT should occur.
    --EXIT_CODE_WHEN_MISMATCH
    1 When one or more mismatches between groups is detected, exit with this value instead of 0.
    --EXPECT_ALL_GROUPS_TO_MATCH
    false Expect all groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), groups (readgroups, libraries, files, or samples) with different sample names are expected to mismatch, and those with the same sample name are expected to match.
    --GENOTYPING_ERROR_RATE
    0.01 Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Must be greater than zero.
    --help
     -h
    false display the help message
    --INPUT_SAMPLE_MAP
    null A tsv with two columns representing the sample as it appears in the INPUT data (in column 1) and the sample as it should be used for comparisons to SECOND_INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.
    --LOD_THRESHOLD
     -LOD
    0.0 If any two groups (with the same sample name) match with a LOD score lower than the threshold the tool will exit with a non-zero code to indicate error. Program will also exit with an error if it finds two groups with different sample name that match with a LOD score greater than -LOD_THRESHOLD. LOD score 0 means equal likelihood that the groups match vs. come from different individuals, negative LOD score -N, mean 10^N time more likely that the groups are from different individuals, and +N means 10^N times more likely that the groups are from the same individual.
    --LOSS_OF_HET_RATE
    0.5 The rate at which a heterozygous genotype in a normal sample turns into a homozygous (via loss of heterozygosity) in the tumor (model assumes independent events, so this needs to be larger than reality).
    --MATRIX_OUTPUT
     -MO
    null Optional output file to write matrix of LOD scores to. This is less informative than the metrics output and only contains Normal-Normal LOD score (i.e. doesn't account for Loss of Heterozygosity). It is however sometimes easier to use visually.
    --NUM_THREADS
    1 The number of threads to use to process files and generate fingerprints.
    --OUTPUT
     -O
    null Optional output file to write metrics to. Default is to write to stdout.
    --OUTPUT_ERRORS_ONLY
    false If true then only groups that do not relate to each other as expected will have their LODs reported.
    --SECOND_INPUT
     -SI
    [] A second set of input files (or lists of files) with which to compare fingerprints. If this option is provided the tool compares each sample in INPUT with the sample from SECOND_INPUT that has the same sample ID. In addition, data will be grouped by SAMPLE regardless of the value of CROSSCHECK_BY. When operating in this mode, each sample in INPUT must also have a corresponding sample in SECOND_INPUT. If this is violated, the tool will proceed to check the matching samples, but report the missing samples and return a non-zero error-code.
    --SECOND_INPUT_SAMPLE_MAP
    null A tsv with two columns representing the sample as it appears in the SECOND_INPUT data (in column 1) and the sample as it should be used for comparisons to INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.
    --version
    false display the version number for this tool
    Optional Common Arguments
    --COMPRESSION_LEVEL
    5 Compression level for all compressed files created (e.g. BAM and VCF).
    --CREATE_INDEX
    false Whether to create a BAM index when writing a coordinate-sorted BAM file.
    --CREATE_MD5_FILE
    false Whether to create an MD5 digest for any BAM or FASTQ files created.
    --GA4GH_CLIENT_SECRETS
    client_secrets.json Google Genomics API client_secrets.json file path.
    --MAX_RECORDS_IN_RAM
    500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
    --QUIET
    false Whether to suppress job-summary info on System.err.
    --REFERENCE_SEQUENCE
     -R
    null Reference sequence file.
    --TMP_DIR
    [] One or more directories with space available to be used by this program for temporary storage of working files
    --USE_JDK_DEFLATER
     -use_jdk_deflater
    false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
    --USE_JDK_INFLATER
     -use_jdk_inflater
    false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
    --VALIDATION_STRINGENCY
    STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
    --VERBOSITY
    INFO Control verbosity of logging.
    Advanced Arguments
    --showHidden
    false display hidden arguments

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --ALLOW_DUPLICATE_READS / NA

    Allow the use of duplicate reads in performing the comparison. Can be useful when duplicate marking has been overly aggressive and coverage is low.

    boolean  false


    --arguments_file / NA

    read one or more arguments files and add them to the command line

    List[File]  []


    --CALCULATE_TUMOR_AWARE_RESULTS / NA

    specifies whether the Tumor-aware result should be calculated. These are time consuming and can roughly double the runtime of the tool. When crosschecking many groups not calculating the tumor-aware results can result in a significant speedup.

    boolean  true


    --COMPRESSION_LEVEL / NA

    Compression level for all compressed files created (e.g. BAM and VCF).

    int  5  [ [ -∞  ∞ ] ]


    --CREATE_INDEX / NA

    Whether to create a BAM index when writing a coordinate-sorted BAM file.

    Boolean  false


    --CREATE_MD5_FILE / NA

    Whether to create an MD5 digest for any BAM or FASTQ files created.

    boolean  false


    --CROSSCHECK_BY / NA

    Specificies which data-type should be used as the basic comparison unit. Fingerprints from readgroups can be "rolled-up" to the LIBRARY, SAMPLE, or FILE level before being compared. Fingerprints from VCF can be be compared by SAMPLE or FILE.

    The --CROSSCHECK_BY argument is an enumerated type (DataType), which can have one of the following values:

    FILE
    SAMPLE
    LIBRARY
    READGROUP

    DataType  READGROUP


    --CROSSCHECK_MODE / NA

    An argument that controls how crosschecking with both INPUT and SECOND_INPUT should occur.

    The --CROSSCHECK_MODE argument is an enumerated type (CrosscheckMode), which can have one of the following values:

    CHECK_SAME_SAMPLE
    CHECK_ALL_OTHERS

    CrosscheckMode  CHECK_SAME_SAMPLE


    --EXIT_CODE_WHEN_MISMATCH / NA

    When one or more mismatches between groups is detected, exit with this value instead of 0.

    int  1  [ [ -∞  ∞ ] ]


    --EXPECT_ALL_GROUPS_TO_MATCH / NA

    Expect all groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), groups (readgroups, libraries, files, or samples) with different sample names are expected to mismatch, and those with the same sample name are expected to match.

    boolean  false


    --GA4GH_CLIENT_SECRETS / NA

    Google Genomics API client_secrets.json file path.

    String  client_secrets.json


    --GENOTYPING_ERROR_RATE / NA

    Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Must be greater than zero.

    double  0.01  [ [ -∞  ∞ ] ]


    --HAPLOTYPE_MAP / -H

    The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details.

    R File  null


    --help / -h

    display the help message

    boolean  false


    --INPUT / -I

    One or more input files (or lists of files) with which to compare fingerprints.

    R List[String]  []


    --INPUT_SAMPLE_MAP / NA

    A tsv with two columns representing the sample as it appears in the INPUT data (in column 1) and the sample as it should be used for comparisons to SECOND_INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.

    File  null


    --LOD_THRESHOLD / -LOD

    If any two groups (with the same sample name) match with a LOD score lower than the threshold the tool will exit with a non-zero code to indicate error. Program will also exit with an error if it finds two groups with different sample name that match with a LOD score greater than -LOD_THRESHOLD. LOD score 0 means equal likelihood that the groups match vs. come from different individuals, negative LOD score -N, mean 10^N time more likely that the groups are from different individuals, and +N means 10^N times more likely that the groups are from the same individual.

    double  0.0  [ [ -∞  ∞ ] ]


    --LOSS_OF_HET_RATE / NA

    The rate at which a heterozygous genotype in a normal sample turns into a homozygous (via loss of heterozygosity) in the tumor (model assumes independent events, so this needs to be larger than reality).

    double  0.5  [ [ -∞  ∞ ] ]


    --MATRIX_OUTPUT / -MO

    Optional output file to write matrix of LOD scores to. This is less informative than the metrics output and only contains Normal-Normal LOD score (i.e. doesn't account for Loss of Heterozygosity). It is however sometimes easier to use visually.

    Exclusion: This argument cannot be used at the same time as SECOND_INPUT.

    File  null


    --MAX_RECORDS_IN_RAM / NA

    When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

    Integer  500000  [ [ -∞  ∞ ] ]


    --NUM_THREADS / NA

    The number of threads to use to process files and generate fingerprints.

    int  1  [ [ -∞  ∞ ] ]


    --OUTPUT / -O

    Optional output file to write metrics to. Default is to write to stdout.

    File  null


    --OUTPUT_ERRORS_ONLY / NA

    If true then only groups that do not relate to each other as expected will have their LODs reported.

    boolean  false


    --QUIET / NA

    Whether to suppress job-summary info on System.err.

    Boolean  false


    --REFERENCE_SEQUENCE / -R

    Reference sequence file.

    File  null


    --SECOND_INPUT / -SI

    A second set of input files (or lists of files) with which to compare fingerprints. If this option is provided the tool compares each sample in INPUT with the sample from SECOND_INPUT that has the same sample ID. In addition, data will be grouped by SAMPLE regardless of the value of CROSSCHECK_BY. When operating in this mode, each sample in INPUT must also have a corresponding sample in SECOND_INPUT. If this is violated, the tool will proceed to check the matching samples, but report the missing samples and return a non-zero error-code.

    Exclusion: This argument cannot be used at the same time as MATRIX_OUTPUT, MO.

    List[String]  []


    --SECOND_INPUT_SAMPLE_MAP / NA

    A tsv with two columns representing the sample as it appears in the SECOND_INPUT data (in column 1) and the sample as it should be used for comparisons to INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.

    File  null


    --showHidden / -showHidden

    display hidden arguments

    boolean  false


    --TMP_DIR / NA

    One or more directories with space available to be used by this program for temporary storage of working files

    List[File]  []


    --USE_JDK_DEFLATER / -use_jdk_deflater

    Use the JDK Deflater instead of the Intel Deflater for writing compressed output

    Boolean  false


    --USE_JDK_INFLATER / -use_jdk_inflater

    Use the JDK Inflater instead of the Intel Inflater for reading compressed input

    Boolean  false


    --VALIDATION_STRINGENCY / NA

    Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

    The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

    STRICT
    LENIENT
    SILENT

    ValidationStringency  STRICT


    --VERBOSITY / NA

    Control verbosity of logging.

    The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

    ERROR
    WARNING
    INFO
    DEBUG

    LogLevel  INFO


    --version / NA

    display the version number for this tool

    boolean  false


    Return to top


    See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

    GATK version 4.1.0.0 built at Wed, 30 Jan 2019 10:21:04 +0530.