The GenerateCNVHaplotypes utility can be used to construct VCF records that represent complex structural haplotypes.
This is a special purpose utility to allow the analysis of complex CNV regions (such as the C4 locus) by combining multiple copy-number variable segments and potentially other markers into structural haplotypes encoded in VCF format. The output is intended to be further processed, for example by doing genotype refinement using beagle.
The primary inputs to this utility are:
- a VCF file containing CNVs with haploid copy number likelihoods (for example as produced by GenerateHaploidCNVGenotypes)
- a list of sites in the input VCF that should be merged
- a list of the structural haplotype alleles to use (see below)
- an optional file indicating samples that should be assigned to particular labeled haplotypes (see below)
A VCF file containing a new VCF record based on the input structural alleles, containing genotype likelihoods and genotype estimates for all of the samples in the input VCF. The genotype likelihoods in the output VCF record will be based on the likelihoods of the component copy number sites from the input VCF. The output file is intended for further downstream processing, particularly genotype refinement using beagle to convert the output likelihoods into hardened genotypes based on the surrounding SNP haplotypes.The (single) output site in the output VCF is labeled with the ID given by -outputSiteId. The interval of the output site can be specified using the -outputInterval argument. If not supplied, the output interval is set to be the union of the intervals of all of the input sites listed by the -site argument. Currently, this utility is only supports emitting one combined output site at a time.
The purpose of this utility is to convert a set of diploid CNV observations (with likelihood estimates) for a set of samples into a set of likelihood estimates for a set of structural haplotypes, to be then used for statistical genotype refinement. The list of target haplotypes is currently specified externally as described below.
This utility uses a special notation convention to describe strutural haplotype alleles. If the number of CNV sites being merged is K, then each allele is represented as a symbolic alternate allele of the form <H_n_n_n_L> where there are K integers following the H (separated by underscores) plus an optional alphabetic label L following these integers and separated by an underscore. The integers represent the haploid copy number at each of the K input sites. The label L can be used to distinguish two or more potentially recurrent haplotypes with the same structure.
For example, at the C4 gene locus, we might distinguish two forms of C4 (C4A and C4B) based on the amino acid sequence at a specific binding site in exon 26. In addition, we might label each copy of C4 as either short or long, depending on the presence of a polymorphic endogenous retrovirus (HERV) insertion in intron 9. If we want to represent the copy number state of a C4 structural haplotype as the number of C4A copies, the number of C4B copies, and the number of HERV copies, we could encode this as haplotypes with K=3. Using this encoding, an example allele of the form <H_1_1_1> would indicate a haplotype with one copy of C4A, one copy of C4B and one copy of the HERV.
The labels are used if there are two different alleles that have the same strutural form but you want to distinguish them for some reason (for example, because they are independent alleles that have arisen on different haplotype backgrounds). In this case you could label them, for example, as <H_1_1_1_A> and <H_1_1_1_B>. This utility will treat alleles of this form as structurally identical but distinct alleles.
The list of alleles to use to generate the output likelihoods is given by the -haplotypeFile parameter. There should be one allele per line. The syntax should match the description above (but should not include the angle brackets).
Given a set of input structural alleles, the likelihood of each sample to carry each combination of alleles is computed based on the input likelihoods for the CNV sites specified in the input VCF. Genotype likelihoods are emitted for all of the allelic combinations. The GT field is set to the most likely genotype combination. GQ is computed based on the likelihood ratio of the most-likely and next most-likely allele combinations. If there are structurally identical input alleles (distinguished by labels), then the likelihoods for this structural state are divided among these labeled alleles, unless a sample is assigned to one particular labeled allele (see the -sampleHaplotypeFile argument).
If the optional -sampleHaplotypeFile argument is supplied, it should be a two-column tab-delimited file with header SAMPLES HAPLOTYPES. The first column should be a sample ID present in the input VCF. The second column should be either one or two alleles, separated by a comma if two are present. The alleles should be in the same format as -haplotypeFile. The purpose of this argument is to allow certain individuals or haplotypes to act as "seeds" for labeled alleles that are otherwise structurally identical. Normally, the likelihoods for all labeled alleles with the same structure are set to be equal (based on the overall likelihood of that structure), but when a sample is present in the input file, its likelihood for that allele will be increased by a value specified by -sampleHaplotypePriorLikelihood (log scale, default 3.0, i.e. a prior of 10^3). As an example, if you have two labeled alleles <H_1_1_1_A> and <H_1_1_1_B>, then the likelihoods of these alleles in the output file will be weighted equally in each sample, unless a sample is specified in this input file to raise its prior on either <H_1_1_1_A> or <H_1_1_1_B> (or both).
The reference allele is set to N and the likelihood of the reference allele is set to be extremely unlikely (1e-1000). You can override the reference allele likelihood with the -defaultLogLikelihood parameter.
If supplied, the optional -unknownAlleleLikelihood parameter will cause an extra "unknown" allele <UNK> to be emitted. The likelihoods for the unknown allele will be assigned based on the parameter value, which is in log10 scale (e.g. -50 means 1e-50). This unknown allele is intended to capture the likelihood of one or more other alleles not present in the input list. Use of this parameter is discouraged, but the parameter is retained for backwards compatibility.
java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \ org.broadinstitute.sv.apps.GenerateCNVHaplotypes \ -R reference.fasta \ -vcf input.haploid.vcf.gz \ -O output.haplotypes.vcf.gz \ -site input_site_list.list \ -haplotypeFile input_haplotypes.txt \ -outputSiteId site_id \ -sampleHaplotypeFile sample_haplotypes.txt
GenerateCNVHaplotypes specific arguments
|-vcf||File||NA||Input file of haploid CNV allele likelihoods|
|-haplotypeFile||File||NA||Input file of possible haplotype alleles|
|-R||File||NA||Reference fasta file|
|-O||File||NA||Output vcf file which will contain the emitted haplotype site record|
|-outputSiteId||String||NA||Site ID to give to the output site|
|-site||List[String]||NA||Ordered list (or .list file) of CNV site IDs to be included in the haplotypes|
|-log||String||NA||Set the logging location|
|-defaultLogLikelihood||Double||-1000.0||Default log likelihood to use when no GLs are present, in particular for the unused reference allele (log10)|
|-genderMapFile||List[File]||NA||Map file or files containing the gender for each sample|
|-l||String||INFO||Set the minimum level of logging|
|-debug||String||NA||Produce verbose debugging output (default false)|
|-warnOnMissingCopyNumberAlleles||String||NA||Produce warnings if referenced copy number alleles are missing (default false)|
|-outputSiteInterval||String||NA||Interval to assign to the output record (default is the union of the input sites)|
|-ploidyMapFile||File||NA||Ploidy map specifying gender-dependent ploidy for each region of the reference|
|-sample||List[String]||NA||Sample(s) or .list file of samples to process|
|-sampleHaplotypeFile||File||NA||File giving priors on samples carrying specific alleles from structurally equivalent recurrent alleles|
|-sampleHaplotypePriorLikelihood||Double||3.0||Default prior for a sample that has a haplotype assignment (log10)|
|-unknownHaplotypeLikelihood||Double||NA||Generate an UNK (unknown) haplotype with the given log10 likelihood|
|-h||Flag||NA||Generate the help message|
|-version||Flag||NA||Output version information|
Default log likelihood to use when no GLs are present, in particular for the unused reference allele (log10).
Map file or files containing the gender for each sample.
Input file of haploid CNV allele likelihoods.
Input file of possible haplotype alleles.
Generate the help message.
Set the logging location.
Set the minimum level of logging.
Produce verbose debugging output (default false).
Produce warnings if referenced copy number alleles are missing (default false).
Output vcf file which will contain the emitted haplotype site record.
Site ID to give to the output site.
Interval to assign to the output record (default is the union of the input sites).
Ploidy map specifying gender-dependent ploidy for each region of the reference.
Reference fasta file.
Sample(s) or .list file of samples to process.
File giving priors on samples carrying specific alleles from structurally equivalent recurrent alleles.
--sampleHaplotypePriorLikelihood / -sampleHaplotypePriorLikelihood ( Double with default value 3.0 )
Default prior for a sample that has a haplotype assignment (log10).
Ordered list (or .list file) of CNV site IDs to be included in the haplotypes.
Generate an UNK (unknown) haplotype with the given log10 likelihood.
Output version information.