The haplotype map that certain Picard tools require is a file that maps SNPs to LD (linkage disequilibrium) blocks. These tools include Picard CrosscheckReadGroupFingerprints and CheckFingerprint. For these tools, the HAPLOTYPE_MAP parameter defines the file.
To view the javadoc documentation for tools within the Picard Jar, type
java -jar picard.jar <tool name> -h
As of this writing (5/5/2017, Picard v2.9.0), the HAPLOTYPE_MAP file is a text-based file that tab-separates fields. In a future release of Picard, this field will also accept VCF formats ending in
.bcf. At that time, tools will interpret all other file extensions for this parameter as the original text-based format.
These two formats differ in their requirements as we outline below.
It has a header and a body as shown.
The header is a standard SAM header, with an @HD line to define the file type and @SQ lines to define the reference contigs. You can easily derive such a header from your reference dictionary file.
The body contains a column header line starting with a
# hash followed by lines that annotate SNPs and blocks in high LD.
Again, the SNPs listed with the same ANCHOR_SNP will be in the same haplotype. If there is a discrepancy between the MAFs within a block, the tool considers the MAF of the first SNP, i.e. that with the smallest genomic position, the MAF of the block. Again, MAF stands for minor allele frequency.
Picard v2.10.1+ (released 2017/7/11) accepts this format. Tools will recognize a VCF format if the file extension ends in
.bcf. Tools will interpret all other file extensions fas the original text-based format we describe above.
Click here to download an example file. Here is the body portion of this example file.
|) and the PS (phase set) format field annotation.
Finally, the VCF specification (v4.2) defines the PS field as follows.
PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)