Home

Cancer Genome Analysis

The Cancer Genome Analysis (CGA) group at the Cancer Program of the Broad Institute of Harvard and MIT is a team of computational biologists, software engineers and research scientists with diverse backgrounds whose aims are to:

(i) Develop, build, maintain and publish tools for analyzing cancer genome data (next-generation sequencing and array-based data);
(ii) Develop, build and maintain a platform for executing research and production analysis on large amounts of data – called "Firehose".
(iii) Operate production-grade pipelines for analyzing cancer genome project data (TCGA and similar projects); and
(iv) Provide the analysis champions as part of cancer genome project teams (which include a tumor-type champion, analysis champion and a project manager) that drive projects from initiation to publication.

The group is central to the cancer genome efforts at the Cancer Program which include the TCGA and other similar projects. The projects are funded by various grants and awards at the Broad including the NCI/NHGRI TCGA Genome Sequencing Center (GSC), TCGA Genome Characterization Center (GCC) and the TCGA Genome Data Analysis Center (GDAC) and other funding sources such as the Carlos Slim Institute of Health and others.

Cancer Genome Analysis Tools

The group is developing tools and pipelines which address two major tasks:
(1) Characterization – Fully describing the genomic events (including somatic and germline events, at DNA, RNA and proteomic levels) in tumor and normal samples coming from a single individual (patient).
(2) Interpretation – Analyzing the characterization data across a set (population or cohort) of individuals with the aim of  identifying genes, regions and pathways that are altered beyond what is expected by chance and to identify subtypes of the disease.

  • Characterization tools
    • ABSOLUTE – estimate purity/ploidy and absolute copy-number and mutation data
    • BreakPointer - pinpoint rearrangement breakpoints in sequencing data
    • ContEst – estimate contamination level in sequencing data
    • dRanger – identify rearrangements in sequencing data
    • HAPSEG – detect genomic segments of constant allele-specific copy-number based on SNP6.0 array data
    • Indelocator – identify short insertions and deletions in sequencing data
    • MuTect – identify point mutations in sequencing data
    • PathSeq – detect pathogenic sequences in sequencing data
    • RNA-SeQC – calculate QC metrics for RNA-seq data
    • SegSeq – detect copy-number changes from sequencing data
  • Interpretation tools
    • GISTIC – detect regions of significant copy-number gains and losses
    • MutSig – detect significantly mutated genes
    • Oncotator – annotate human genomic point mutations and indels