D-ToxoG

What is D-ToxoG?

D-ToxoG is a tool for removing the OxoG artifact from a set of SNV calls.

How does D-ToxoG work?

The goal of D-ToxoG is to limit the output mutation calls to less than 1% artifact

Steps:

  1. Make an estimate of the number of OxoG artifacts in a given MAF file
  2. Calculate the likelihood that a given SNV is an OxoG artifact
  3. Apply threshold (.01) to false discovery rate

 

Costello, M., Pugh, T. J., Fennell, T. J., Stewart, C., Lichtenstein, L., Meldrim, J. C., et al. (2013). Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Research, 41(6), e67. http://doi.org/10.1093/nar/gks1443

How do I get D-ToxoG?

The D-ToxoG Matlab scripts are available here (ZIP 50kb).

Running D-ToxoG

Prerequisites

  • Matlab 2012a or above.  Earlier versions may work, but have not been tested.
  • An input maf file with the columns listed below
  • The D-ToxoG matlab scripts

Input

SNV MAF file with the below columns.  Column names are case sensitive

  1. Chromosome -- Contig number without a prefix.  E.g. "3" or "X" (without quotes)
  2. Start_position -- Position of the SNV
  3. End_position -- Should be the same as Start_position for SNV
  4. Reference_Allele -- The allele found in the reference genome at this position.  Single character representing the base, e.g. "G".
  5. Tumor_Seq_Allele1 -- alternate allele.  Single character representing the base, e.g. "G".
  6. Tumor_Sample_Barcode -- name of the tumor (or "case") sample.  This is used to generate file names and plots.
  7. Matched_Norm_Sample_Barcode -- name of the normal (or "control") sample.  This is used to generate file names and plots.
  8. ref_context -- Small window into the reference at the SNV.  The center position should be the same as Reference_Allele.  The total string should be of odd length and have a minimum length of 3. 
    For example: Reference Allele is G, Chromosome is 1, Start_position and End_position are 120906037:  ref_context is CTTTTTTCGCGCAAAAATGCC  (string size is 21, in this case)
  9. i_t_ALT_F1R2 -- the number of reads with pair orientation of F1R2 and with the alternate allele (Tumor_Seq_Allele1).
  10. i_t_ALT_F2R1 -- the number of reads with pair orientation of F2R1 and with the alternate allele (Tumor_Seq_Allele1).
  11. i_t_REF_F1R2 -- the number of reads with pair orientation of F1R2 and with the reference allele (Reference_Allele).
  12. i_t_REF_F2R1 -- the number of reads with pair orientation of F2R1 and with the reference allele (Reference_Allele).
  13. i_t_Foxog -- Foxog, as described in the methods.  Depending on the nature of the reference and alternate alleles, either i_t_ALT_F1R2/(i_t_ALT_F1R2 + i_t_ALT_F2R1)  or i_t_ALT_F2R1/(i_t_ALT_F1R2 + i_t_ALT_F2R1).
    1. C>anything:  numerator is i_t_ALT_F2R1
    2. A>anything:  numerator is i_t_ALT_F2R1
    3. G>anything:  numerator is i_t_ALT_F1R2
    4. T>anything:  numerator is i_t_ALT_F1R2
  14. Variant_Type -- "SNP" (without quotes)

Output

There are many intermediate files and figures generated by the filter, see the directory specified when running the filter. 

The main outputs are two maf files.  One containing all input calls and another with all artifact calls removed (i.e. no lines where oxoGCut = 1).  These maf files will also have the below columns added.

New columns are added to the maf files:

  1. pox --  p-value that the call is actually an artifact
  2. qox --  false detection rate score. 
  3. pox_cutoff --  minimum pox score for artifact.
  4. isArtifactMode:
    1. Variant is C>A, G>T: 1
    2. Variant is not C>A or G>T: 0
  5. oxoGCut:
    1. Variant is marked as artifact: 1
    2. Variant is not an artifact: 0
       

Usage

To execute D-ToxoG, run startFilterMAFFile

The latest usage instructions can be found by running help startFilterMAFFile.

Example

% Run an input maf and output it to a pass.maf.annotated.  Use a mat file if available to speed loading of the maf file.  Generate plots and use the standard PoxoG of .96.  Take the defaults for the rest of the parameters.  Put all outputs into the results directory.

startFilterMAFFile('C:\Lee\work\oxoGv3Results\PR_TCGA_HNSC_Capture.maf.annotated', 'PR_TCGA_HNSC_Capture.pass.maf.annotated', 'results/', 1, 1, '0.96')

Background

Experiments carried out by the sequencing platform have established that the OxoG artifact arises from an oxidation process in library construction. Prior to the PCR step, conversion of guanine to 8-oxoguanine (8-oxoG) tends to occur in the context of CGG (where the 8-oxoG is the middle G) sequences. Unlike regular guanine, 8-oxoG has a higher chemical affinity to bind with adenine rather than cytosine. The presence of 8-oxoG during subsequent PCR amplification cycles introduces adenine bases into DNA molecules at sites where cytosine should have been, systematically producing CAG sequences where the original sequence was CCG. Unlike natural mutations, non-reference artifact bases are locked specific strand orientations by the forked adapter ligation step before PCR, resulting in G>T artifacts on the F1R2 orientation and C>A artifacts on the F2R1 orientation. An IGV screenshot of an artifact mutation from sample BLCA_A2C5 is shown below.

Above is the distribution of orientation biases from a test set of 8 tumor types (40 samples with varying degrees of artifact). The red spike close to FoxoG=1 is the artifact component.