ABSOLUTE (v2) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Extracts absolute copy numbers per cancer cell from a mixed DNA population. Use this module for the per-sample processing step in the workflow.

Author: Scott Carter, Matthew Meyerson, Gad Getz

Contact:

Algorithm Version: ABSOLUTE 1.0.6

Summary 

ABSOLUTE provides various models of tumor cell purity and ploidy for subsequent manual solution selection. ABSOLUTE infers multiple models of purity, malignant cell ploidy and absolute somatic copy-numbers from copy ratios data. It determines possible models for absolute copy numbers per cancer cell from a mixed DNA population and gives copy numbers for genomic segments, and if provided mutation data, for mutated alleles. Use of homologue-specific copy ratios (HSCRs) data reduces ambiguity of copy profiles, compared to using total copy-ratios data, e.g. from comparative genomic hybridization (CGH) or low-pass sequencing. Results from multiple ABSOLUTE runs are compiled into a format facilitating manual solution selection by GenePattern's ABSOLUTE.summarize module. Manual review is necessary as, for a given tumor, the highest scoring model is not always the best solution. Manually selected solutions are then provided to ABSOLUTE.review for finalized results.

Background

Elucidation of the sequence of the multiple genomic events that give rise to tumorigenesis is an ongoing area of research. Genomic events include functional mutations, genomic rearrangements including translocations and chromothripsis, gene conversion or loss of heterozygosity (LOH), and somatic copy number alterations (SCNAs) that range from regional and chromosomal amplifications and deletions to whole genome duplications (Burrell et al.). SCNAs can lead to gene dosage changes impacting phenotype; SCNAs and copy neutral LOH events at heterozygous or mutant loci can lead to unequal dose contributions of one allele over the other. 

Current models calculate somatic alterations in units of genomes or DNA mass and are interpreted in the context of a tumor's purity and overall ploidy. However, to compare across samples, copy numbers should be measured in copies per cancer cell. Absolute copy numbers could be inferred by normalizing relative data on cytological measurements of DNA mass per cell or on single-cell sequencing data. Alternatively, ABSOLUTE can be used to mathematically model solutions of tumor cell purity and ploidy.

Inferring absolute copy numbers and ranking solutions depends on the following three factors. These are (1) sample heterogeneity from copy ratios and mutation data, (2) karyotype models from a reference panel built into ABSOLUTE algorithm v1.0.6, and (3) allelic fraction from mutation data. Providing mutation data though optional is recommended.

(1) Sample heterogeneity. Samples are heterogeneous at two tiers. (i) Tumor purity indicates the fraction of tumor cells to normal cells that nearly always contaminate samples, e.g. normal tissue and blood cells. Normal cells are diploid (2N) and are further identified by normal genotype(ii) Tumor cell heterogeneity, if any, based on polygenomic populations, either segregated or intermixed, due to ongoing subclonal evolution. Each tumor population is grouped by ploidy, which is defined in units of normal haploid genomes for genomic segments. Segments are previously defined by equal copy ratio. 

One method to validate purity estimates, used by the authors, compared calculated and histological purity estimates with methylation signatures characteristic of leukocytes given blood is a common sample contaminant.

(2) Karyotype models. Copy ratios data will provide a number of putative integer value solutions of ploidy from which purity is then inferred. In the first two charts above, solutions are in different colors (circles and bars). To better rank these solutions, ABSOLUTE refers to external data in the form of karyotype models. These mixture models of recurrent cancer karyotypes were bootstrapped from thousands of pre-TCGA tumor samples matched to cytological data (Carter et al, 2012). Karyotype models do not impact calculation of individual solutions, only their ranking. Likelihoods from the SCNAs, SSNVS, and pan-cancer karyotype models are combined to produce rankings. 

For increased sensitivity for ambiguous cases, when given a primary disease parameter, ABSOLUTE incorporates karyotype models specific to the tumor type in ranking solutions. The impact of this is seen, for example, in differentiating ambiguous solutions, one of which implies a genome doubling event. The frequency of genome doublings vary across tumor types and reflect disease-tissue specific biology. Genome doublings are rare in hematopoietic neoplasms, e.g. ALL and CLL, and have a higher incidence in other types of cancer, such as oesophageal adenocarcinoma (Barrett et al. 1999).

(3) Allelic fraction. ABSOLUTE utilizes the optionally provided, but recommended, mutation data in two ways. (i) ABSOLUTE infers purity of a sample with copy number data in conjunction with mutation data. (ii) ABSOLUTE estimates cellular multiplicity, that is, average allelic copies per cancer cell, to potentially reveal subclonal populations as diagrammed in the fourth chart. Putative solutions incorporating mutation data aid in the manual selection of a best solution. What is key for ABSOLUTE is that the mutation information provide somatic events. 

Given the likely divergent instigations of different types of genomic events in cancer, SCNAs alone provide limited resolution in inferring tumor heterogeneity. Sequence mutation information provides ABSOLUTE an alternative point of reference, that is, more incremental information in tumor progression, that then allows a more comprehensive modeling of tumor heterogeneity.

  • High confidence calling is possible for somatic point mutations, a type of somatic single nucleotide variation (SSNVs), with algorithms such as VarScan or MuTect. ABSOLUTE algorithm v1.0.6 expects point mutations and given other types of mutations, e.g. insertions, still treats these as point mutations, which is not best-practice. A future version of ABSOLUTE will differentially utilize insertion and deletion mutations from point mutations. 
  • Inclusion of germline variants leads to inflated purity estimates as they are present clonally, in both the tumor and the normal.
  • Because discrete allelic fractions are obscured by tumor purity and local copy number, ABSOLUTE uses cellular multiplicity units. 

The module's default parameters reflect the original analysis aims of balancing over-fitting subclonal copy alterations to derive more complex karyotypes against the applicability of a simpler solution in finding tumor samples with high purity. For example, default parameters discard solutions with greater than 5% subclonal fractions and thus skew presented solutions to those of increased ploidy. Change default parameters for samples expected to have a higher proportion of heterogeneous nuclei, especially those for which mutation data are also provided. 

Algorithm

Equations used in the algorithm are in the Carter et. al. publication.

ABSOLUTE extracts the absolute copy number of local DNA segments per cancer cell from the mixed DNA population in three steps:

  1. Estimates the tumor purity and ploidy from observed relative copy profiles and, if provided, from somatic point mutation data.
  2. Resolves ambiguous cases of purity and ploidy using pre-computed statistical models of recurrent cancer karyotypes based on a large and diverse reference sample collection.
  3. Attempts to account for copy number alterations and point mutations in tumor subclones.

ABSOLUTE expects copy-ratios very close to 1.0 and will fail if ratios are less than 0.75 or greater than 1.25. ABSOLUTE analysis can fail due to exceeding the max.as.seg.count threshold. Too many segments are associated with noisy or poor quality data. 

References

Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, Beroukhim R, Pellman D, Levine DA, Lander ES, Meyerson M, Getz G.  Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413-21. (abstract and link to PDF)

Parameters

Name Description
seg dat file * A HAPSEG output file (<plate.name>_<array.name>.segdat.RData) or other segmented copy number data file.  If you supply a tab-delimited segmentation file, see the Input Files section for file details.
output file name base *

If specified, provides a base filename for all output files. The default value is the sample name parameter.

  • Note the downstream module ABSOLUTE.summarize requires each sample name to be unique, not just the output file name. Towards this end, for multiple concurrent file input, only the sample name parameter need be varied for unique sample and file names.
sigma p * Provisional value of excess sample level variance used for mode search.  Default: 0
max sigma h * Maximum value of excess sample level variance.  For more details, see equation 6 in the ABSOLUTE paper.  Default: 0.015
min ploidy * Specifies the minimum ploidy value, N, for the algorithm to consider, and models implying lower ploidy values will be discarded.  Default: 0.95N
max ploidy * Specifies maximum ploidy value, N, to consider, and models implying greater ploidy values will be discarded.  Default: 10N
primary disease *

Primary disease of the sample for specific tumor karyotype matching. Enter 'NA' to use pan-cancer karyotype reference. This parameter impacts ranking of solutions and not solutions themselves. If a provided input does not match to the following list, then ABSOLUTE defaults to the pan-cancer reference:

Acute myelogenous leukemia, Bladder Cancer, BLCA, Brain Cancer, BRCA, Breast, Breast Cancer, Carcinoid, ccRCC, Cervical Cancer, CESC, Chronic Lymphocytic Leukemia, Chronic lymphocytic leukemia, CLL, COAD, Colon Cancer, Colorectal, Dedifferentiated Liposarcoma, Endometrial Cancer, Esophageal adenocarcinoma, Esophageal Cancer, Esophageal squamous, Ewing Sarcoma, Gastric, Gastric Cancer, GBM, GIST, Glioma, Head and Neck Cancer, Hepatocellular Carcinoma, Kidney Cancer, Kidney cancer, KIRC, LAML, Leiomyosarcoma, Liver Cancer, LUAD, Lung, Lung adenocarcinoma, Lung adenosquamous, Lung Cancer, Lung SCLC, Lung squamous, LUSC, Lymphoma, Medulloblastoma, Melanoma, Mesothelioma, MFH, Mulitple Myeloma, Multiple Myeloma, Myelodysplasia, Myeloproliferative Disorder, Myxoid Liposarcoma, Neuroblastoma, NSCLC, Osteosarcoma, OV, Ovarian, Ovarian Cancer, Pancreatic Cancer, Pediatric Acute lymphoblastic leukemia, Pediatric GIST, Pleomorphic Liposarcoma, PRAD, Prostate, Prostate Cancer, READ, Rectum Cancer, Renal Cancer, Rhabdoid, Rhabdoid Tumor, Sarcoma, Stomach Cancer, Synovial sarcoma, Thyroid, Thyroid Cancer

platform *

The platform used to generate the data.  Supported platforms are:

  • SNP_250K_STY
  • SNP_6.0 (default)
  • Illumina_WES
sample name * The name of the sample for display and for use in downstream module ABSOLUTE.summarize, which, for multiple concurrent file input, requires unique sample names. 
max as seg count * Maximum number of allelic segments. Samples with a higher segment count will be flagged as 'failed'.  Default: 1500
max neg genome *

Sometimes, due to noise in the data, ABSOLUTE may model the fraction of the genome attributed to tumor subclones to be less than zero.  This parameter specifies the maximum allowable fraction of the genome that can be modeled as being less than zero without discarding a given solution. Default: 0.005

max non clonal *

Maximum genome fraction that may be modeled as non-clonal — that is, as being derived from tumor subclones. Solutions implying greater values will be discarded.  Default: 0.05 

  • Increase this parameter for samples expected to have a higher proportion of heterogeneous nuclei, especially if mutation data is also provided.
copy number type *

The copy number type to assess based on input data type.  

  • allelic (default) for data from HAPSEG or AllelicCapseg 
  • total for all other data 
maf file If available, somatic mutation data in mutation annotation format (MAF) that includes t_ref_count and t_alt_count columns. See Input Files section for more details. If using this parameter, also specify the min mut af parameter described next.
min mut af Mutations with lower allelic fractions than the indicated minimum mutation allelic fraction will be excluded from analysis. Zero is an accepted value. Note that if maf file is specified, min mut af must also be specified.

* - required

Input Files 

Each file represents one sample. Files containing multiple sample sets are unacceptable. ABSOLUTE algorithm v1.0.6 filters features mapping to any chromosome labeled “X”. 

  1. Segmented copy ratios data file in either of the following two formats:
    • For ALLELIC copy number type analysis, supply an RData file produced by HAPSEG or AllelicCapseg. These datasets allow incorporation of copy neutral LOH events. Segmentation data produced by any other means must conform to the output formats of HAPSEG/AllelicCapseg for ABSOLUTE to consider copy neutral LOH events. 

    • For TOTAL copy number type analysis, suppy a tab-delimited segmentation file in plain-text format. File extension does not matter. ABSOLUTE algorithm v1.0.6 requires the following five columns. Additional columns are ignored.

      • Chromosome
        • In either chr# or # format. 
      • Start
      • End
      • Num_Probes
      • Segment_Mean
  1. (Optional) Somatic mutation data in mutation annotation format (MAF) and as a plain text file. File extension does not matter and hashtagged header rows (#) may be present. ABSOLUTE algorithm v1.0.6 requires the following seven columns. Additional columns are ignored.
  • t_ref_count OR i_t_ref_count 
    • Count of reference alleles in tumor.
  • t_alt_count OR i_t_alt_count 
    • Count of alternate alleles in tumor. Together with t_ref_count adds up to the depth of reads in the tumor BAM alignment. You can calculate a missing value if two of these three values are known or with read depth and the frequency of the alternate allele within the sample. These and other MuTect output columns are described further in the GATK forum.
  • dbSNP_Val_Status
    • Fields may be blank and multiple values are separated with nonspaced semicolon. Example values include bySubmitter, by1000genomes, by2Hit2Allele, and byHapMap.
  • Start_position 
    • Note the lowercase "p". Also, note that the End_position column is not required. This implies that ABSOLUTE algorithm v1.0.6 treats all mutation data equally as point mutations, the expected type of mutation data.
  • Tumor_Sample_Barcode
    • Fields may be blank.
  • Hugo_Symbol
    • Fields may be blank or "unknown".
  • Chromosome
    • Must be in # format and not chr# format. The # value must correspond to that in the segmented copy ratios data file identically. For example, ABSOLUTE does not equate X with 23 and will exclude these mutations as unmapped mutations. Note ABSOLUTE algorithm v1.0.6 excludes X chromosome data but not numbered chromosome, e.g. chr23, data.

Output Files

  1. <output.file.name.base>_plot.pdf
  • Three to four types of plots showing a number of modeled solutions. The fourth plot type is given if mutation data is provided. Each modeled solution is represented by a color across the plot types and presented in the order of combined likelihood. Please refer to the Analyzing ABSOLUTE Data page for detailed descriptions of the plots. These plots are (1) purity and ploidy solutions, (2) likelihoods of each of the solutions based on SCNAs, karyotype, SSNVs (if given mutation data), and combined, (3) genomic fraction versus copy ratio on an absolute scale for each proposed solution, and (4) if given mutation data, SSNV allelic fraction, SSNV multiplicity, and cancer cell fraction (CCF) charts for each solution.
  • The order of the presented solutions represents the ranking. Review these solutions and count the number rank of what you consider the best solution. You will use this number when you modify the calls file to override the top ranked solution in finalized results.
  1. <output.file.name.base>.RData
  • An R file containing an object ‘seg.dat’ which provides all of the information used to generate the plots. This file serves as the input to ABSOLUTE.summarize.

Whether or not you get an error message, or if a PDF is not produced, examine the stdout.txt and stderr.txt files from your jobs for clues on what may have caused the error or to note what portions of data are excluded from the analysis based on filtering mechanisms in place.

  • For example, the stdout.txt tells you how many mutations were unmapped, that is did not have a corresponding segment to map to in the segmentation file and thus were excluded. Segmentation data may exclude chromosome end regions for which data were too noisy to obtain copy ratios. 
  • If a PDF plot is not produced alongside the RData file, then the stdout.txt may show that all the solutions, that is modes, were removed based on parameter settings.

Example Data

The dataset samples are from a single human bladder cancer patient. The following parameters were used to obtain results: max sigma h = 0.2, min ploidy = 0.5, max ploidy = 8, primary disease = BLCA, platform = SNP_6.0, varied sample names, max non clonal = 1, and copy number type = total. For the two samples with somatic mutation data, min mut af = 0.1.

The results were then passed through ABSOLUTE.summarize, manually reviewed and augmented to select for alternative solutions and finalized through ABSOLUTE.review. Download these example results and the example override file using the following links:

Requirements

Acceptance of the module license is required for its use. A copy of the license text is available at http://www.broadinstitute.org/cancer/cga/sites/default/files/images/ABSOLUTE_HAPSEG_license_2013.pdf

The ABSOLUTE module runs only on GenePattern 3.4.2 or above and requires R2.15 with the following packages, each of which will automatically download and install when the module is installed:
  • numDeriv_2012.9-1
  • getopt_1.17
  • optparse_0.9.5

Please install R2.15.3 instead of R2.15.2 before installing the module. The GenePattern team has confirmed test data reproducibility for this module using R2.15.3 compared to R2.15.2 and can only provide limited support for other versions. The GenePattern team recommends R2.15.3, which fixes significant bugs in R2.15.2, and which must be installed and configured independently as discussed in Using Different Versions of R and Using the R Installer Plug-in. These sections also provide information on patch level fixes that are necessary when additional installations of R are made and considerations for those who use R outside of GenePattern.

Platform Dependencies

Task Type:
SNP Analysis

CPU Type:
any

Operating System:
any

Language:
R2.15

Version Comments

Version Release Date Description
1.5 2015-10-13 Updated to make use of the R package installer.
1 2013-06-30 Initial version.