HAPSEG (v1)

A probabilistic method to interpret bi-allelic marker data in cancer samples.

Author: Scott Carter, Matthew Meyerson, Gad Getz

Contact:

absolute-help@broadinstitute.org, http://www.broadinstitute.org/cancer/cga/cga_forums, gp-help@broadinstitute.org

Algorithm Version: HAPSEG 1.1.1

Summary

The HAPSEG module takes single nucleotide polymorphism (SNP) microarray data and outputs copy number data segmented by haplotype.  The output data is suitable for use as input data for the ABSOLUTE module.

Introduction

The human genome typically consists of a set of chromosome pairs, with one chromosome in each pair, known as a homolog, derived from each parent.  For a given gene on a given chromosome, there is a comparable, if not identical, gene on the homologous chromosome, known as an allele.

Cancer cells frequently have large structural alterations in their chromosomes.  Genes within these structurally-altered regions may be simply rearranged, or may be duplicated or deleted.  Thus, instead of having a homologous pair of alleles for a given gene, there may be fewer or more copies than normal for that gene.  For a gene (also referred to as being "at a" marker or locus) whose two alleles are heterozygous, this can lead to unequal contribution of one allele over the other, altering the copy number of a given allele, as shown in this simplified figure:

This unequal contribution of alleles can be used to directly determine (or phase) the haplotypes of the homologous chromosomes.  A haplotype is a group of loci/markers/genes or clusters of single nucleotide polymorphisms (SNPs) that were inherited together from a single parent because of genetic linkage -- the phenomenon by which genes that are close to each other on the same chromosome are often inherited together.  Haplotype phasing can be inferred by associating different markers with the way they are duplicated or deleted together.  As a very simple example:

HAPSEG uses the chromosomal rearrangements in cancer cells to infer the most likely haplotypes for a given set of markers (usually SNPs on a microarray).  The output from HAPSEG can be used as input for the ABSOLUTE module.  

Algorithm

Estimation of the precise contribution of each homolog in a DNA sample obtained from cancer tissue is crucial to understanding the genetic alterations occurring in the cancer cells.  HAPSEG:

  1. Uses data from a defined set of SNPs on a microarray and summarizes information from hundreds of thousands of data points into hundreds of chromosomal segments.
  2. Calculates the copy ratio, which is:

    The copy ratio of a given allele depends on the germline (normal diploid) genotype and on the concentration of the homolog where the allele resides in the cancer-derived DNA sample.
  3. Divides the genome into regions of equal copy number -- that is, each segment is a section of a chromosome where all the loci have the same number of copies.
  4. Models four distinct genotypes in each segment using the statistical program BEAGLE (included in HAPSEG; for more information on BEAGLE see http://faculty.washington.edu/browning/beagle/beagle.html), which allows it to attribute contiguous chromosomal blocks of variation to one or the other homolog, identifying the contribution of a single parent (phasing the haplotypes).

 

The following figure shows data from a SNP microarray after segmentation with HAPSEG.  The SNP data are presented as allelic copy ratio (vertical axis) over the human genetic sequence, as separated into chromosomes (horizontal axis).  HAPSEG determines where distinct copy ratios occur along different chromosomes and marks those segments with green lines.

Being able to directly determine haplotypes from cancer samples presents an opportunity to expand reference panels of phased chromosomes, which may have general interest in population genetic applications.  (A reference panel is a collection of samples that are not of direct interest in the experiment, but that are included in an analysis for the purposes of increasing statistical power or accuracy for the samples of interest.)  In addition, this ability can be used to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than possible with unphased genotypes.

Note: HAPSEG may require several hours to run per sample.

References

Carter SL, Meyerson M, Getz G. Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping. Available from Nature Precedings; 2011. (abstract and PDF link)

BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html

Parameters

Name Description
plate name * Name of the sample plate.  This is used for display and reporting purposes only.
array name * Name of the chip that was run.  This is used for display and reporting purposes only.
seg file Segmented copy number data file for this sample (e.g., from GLAD, CBS, or similar algorithms). If this file is not provided, HAPSEG will segment the data for you.
snp file * SNP intensity file for this sample.
out file name The name of the output file.  By default, this will be <plate.name>_<array.name>.segdat.RData
genome build *

Which build of the human genome to use. The supported values are currently:

  • hg18
  • hg19 (default)
platform *

The microarray chip type used. The supported values are currently:

  • SNP_250K_STY
  • SNP_6.0 (default)
use pop *

HAPMAP population to use.  The currently supported values are:

  • CEPH (default): Utah residents with ancestry from northern and western Europe
  • CH: Han Chinese in Beijing, China
  • JA: Japanese in Tokyo, Japan
  • YOR: Yoruba in Ibadan, Nigeria
impute gt * If set to TRUE, the module will impute genotypes using BEAGLE (included in the HAPSEG module). The authors recommend this be TRUE.
plot segfit * If set to TRUE, the module will plot JPG images of the segmentation fits.
merge small * If set to TRUE, the module will merge small segments. The algorithm for merging segments can be found in the HAPSEG paper.
merge close * If set to TRUE, the module will merge close segments. The algorithm for merging segments can be found in the HAPSEG paper.
min seg size * Minimum segment size.  Default: 10
normal * If set to TRUE, the module will treat this sample as a normal sample.  The default is FALSE.
out p * Outlier probability.  Default: 0.05
seg merge thresh * The distance threshold for merging segments.  Default: 1e-10
use normal * If set to TRUE, the module will use a matched normal sample if one is provided.  The default is FALSE.
drop x * If set to TRUE, the module will remove the X chromosome from the calculation.  The default is FALSE.
drop y * If set to TRUE, the module will remove the Y chromosome from the calculation.  The default is TRUE.
calls file If you are using a matched normal sample, a Birdseed SNP calls file must be supplied.  Birdseed is a SNP genotyping algorithm, and it outputs a file containing Birdseed genotype calls of 0 (AA), 1 (AB), or 2 (BB).
mn sample If using a matched sample (use normal is set to TRUE), the name of that matched normal sample.
calibrate data * Calibration is the process by which SNP measurements are standardized to copy ratios.  If On, the module will perform a calibration on the input data.  If Off, no calibration will be performed.  If left at the default value (Inferred), the calibration status will be inferred.
clusters file If calibrate data is On the user must supply a Birdseed clusters file. Birdseed is a SNP genotyping algorithm, and it outputs a file containing the estimates of means and variances of intensities for each SNP for AA samples, AB (heterozygous) samples, and BB samples.
prev theta file An optional file storing the previous theta values.  Theta values represent the allelic intensity ratios for SNPs on the array.  Equal heterozygotes have a ratio of 0.5, while homozygous calls gives values of ~0.8 and ~0.2.

* - required

Input Files

  1. <snp.file>

A SNP intensity file containing this sample, which can either be per-sample (default) or multi-sample.  This is a tab-delimited file with two columns named A and B and the row names correspond to the chip's probeset IDs. This file can either be a text file or a saved RData file (created in the R programing language via write.table or the equivalent) containing that data as the object dat

In a multi-sample SNP file, the probeset IDs in column A will be repeated for each sample and are distinguished by having "<array name>-" prepended to each.  HAPSEG will use the <array name> parameter to decide which to load on that run, taking a multi-sample file but only operating on the chosen sample.

  1. <seg.file>

A segmented copy number file (e.g., from GLAD, CBS, etc).

  1. <clusters.file>

A Birdseed clusters file, either processed by the Affymetrix SNP6 Copy Number Inference Pipeline or raw from Birdseed.  This is a tab-delimited file where row names are the probeset IDs. In this case there are 6 columns: AA.a, AB.a, BB.a, BB.b, AB.b and AA.b.

  1. <prev.theta.file>

A file storing theta values from previous HAPSEG runs.

Output Files

  1. <plate.name>_<array.name>.segdat.RData

The copy number data segmented by haplotype.  This is suitable as an input to the ABSOLUTE GenePattern module.

  1. chr*/HAPSEG_SEG*.jpg

Per-chromosome plots of the fitted segments.  There will be one subdirectory for each chromosome and one plot for each fitted segment.  These are only provided if <plot.segfit> is TRUE. Note that these files will not be created on Windows

Example Data

A set of example data from the CGA group is available at:ftp://ftp.broadinstitute.org/pub/genepattern/example_files/HAPSEG_1.1.1/paper_example.zip

Note that there is a README file in the ZIP archive that provides the filenames and parameters you will need to run this example data through HAPSEG, ABSOLUTE, ABSOLUTE.summarize, and ABSOLUTE.review.

Requirements

Acceptance of the module license is required for its use.  A copy of the license text is available here: http://www.broadinstitute.org/cancer/cga/sites/default/files/images/ABSOLUTE_HAPSEG_license_2013.pdf

There is a possible known issue in running HAPSEG in GenePattern on a Mac with multiple versions of R installed: if R 2.15.2 is not set as the current version of R then HAPSEG may fail with fontconfig errors.  These errors should not occur if R has been configured according to the GenePattern Administrative Guide sections on Different Versions of R and Using the R Installer Plug-in.  You may be able to fix this problem by using the RSwitch utility to change your current R to 2.15.2.  Alternatively, setting 'plot segfit' to FALSE might also prevent the error.

There is also a known issue with running HAPSEG on Windows, wherein the jpg files are not output. The .segdat.RData file produced, however, is valid.

The HAPSEG module runs only on GenePattern 3.4.2 or above and requires R2.15.2 with the following packages:

  • boot_1.3-7
  • class_7.3-5
  • cluster_1.14.3
  • foreign_0.8-51
  • KernSmooth_2.23-8
  • lattice_0.20-10
  • MASS_7.3-22
  • Matrix_1.0-9
  • mgcv_1.7-21
  • nlme_3.1-105
  • nnet_7.3-5
  • rpart_3.1-55
  • spatial_7.3-5
  • Rcpp_0.9.14
  • numDeriv_2012.9-1
  • iterators_1.0.6
  • foreach_1.4.0
  • BiocGenerics_0.4.0
  • DBI_0.2-5
  • xtable_1.7-0
  • XML_3.95-0.1
  • RSQLite_0.11.2
  • IRanges_1.16.2
  • Biobase_2.18.0
  • AnnotationDbi_1.20.1
  • annotate_1.36.0
  • RColorBrewer_1.0-5
  • geneplotter_1.36.0
  • getopt_1.17
  • optparse_0.9.5
  • DNAcopy_1.32.0

Each of these R packages will be automatically downloaded and installed when the module is installed.  This process will take some time due to the size and number of these packages.  R2.15.2 must be installed and configured independently.

Note that HAPSEG may require several hours to run per sample.  While it is not strictly required, a computational grid or dedicated multi-core server is highly recommended.  The computation generally requires at least 6G of available RAM.

HAPSEG assumes the presence of a <shared.data.home> custom server property pointing to a directory for shared data.  That directory should hold phased BEAGLE files for the hg18 and hg19 genomes in a hierarchical structure of:

<shared.data.home>/BEAGLE/phasedBGL/hg18/files and <shared.data.home>/BEAGLE/phasedBGL/hg19/files

The phased BEAGLE files can be obtained from: ftp://ftp.broadinstitute.org/pub/genepattern/example_files/HAPSEG_1.1.1/phasedBGL.zip

Platform Dependencies

Task Type:
SNP Analysis

CPU Type:
any

Operating System:
any

Language:
R2.15.2

Version Comments

Version Release Date Description
1 2013-06-30 Initial version.