Collection Details

H hallmark collection details

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

This collection is an initial release of 50 hallmarks which condense information from over 4,000 original overlapping gene sets from v4.0 MSigDB collections C1 through C6. We refer to the original gene sets as "founder" sets.

Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures.

To cite your use of the collection, and for further information, please refer to Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015 Dec 23;1(6):417-425.

C2 collection details

Gene sets in this collection come from such sources as:

  • Online pathway databases: Gene sets representing metabolic and signaling pathways are imported from the online pathway databases listed here.
  • Biomedical literature: Over the past few years, microarray studies have identified signatures of several important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). This collection makes many of these signatures, originally published as tables in a paper, available as gene sets. To do this, we compiled a list of microarray articles with published gene expression signatures and, from each article, extracted one or more gene sets from tables in the main text or supplementary information. Currently, this collection includes gene sets from more than 340 PubMed articles. We are working to create a more automated method of curating gene sets from the literature.
  • L2L: Gene sets compiled from published mammalian microarray studies (Newman and Weiner, Genome Biology 2005, 6(9):R81).
  • MYC Target Gene Database: gene sets curated by Dr. Chi Dang from the MYC Target Gene Database at Johns Hopkins University School of Medicine.

C2: CP collection details

The pathway gene sets are curated from the following online databases:

Name URL/Reference
BioCarta http://www.genecarta.com
KEGG http://www.genome.jp/kegg
Matrisome http://matrisomeproject.mit.edu
Pathway Interaction Database http://pid.nci.nih.gov
Reactome http://www.reactome.org
SigmaAldrich http://www.sigmaaldrich.com/life-science.html
Signaling Gateway http://www.signaling-gateway.org
Signal Transduction KE http://stke.sciencemag.org
SuperArray http://www.superarray.com

C4: CGN collection details

Starting with a curated list of 380 cancer-associated genes (Brentani, Caballero et al. 2003, Proc. Natl. Acad. Sci. USA 100, 13418-13423), Subramanian, Tamayo et al. (2005, PNAS 102, 15545-15550) mined 4 expression compendia datasets for correlated gene sets. Gene neighborhoods with <25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets.

  • Human tissue compendium (Novartis): Gene expression profiles from the Novartis normal tissue compendium, as published in Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004) Proc. Natl. Acad. Sci. USA 101, 6062-6067.
  • Global Cancer Map (Broad Institute): Gene expression profiles from the global cancer map, as published in Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., et al. (2001) Proc. Natl. Acad. Sci. USA 98, 15149-15154.
  • NCI-60 cell lines (National Cancer Institute): Gene expression profiles from the NCI 60 data set downloaded from the Developmental Therapeutics Program web site (http://dtp.nci.nih.gov/mtargets/download.html). No preprocessing was done other than collapsing probe IDs to gene symbols.
  • Novartis carcinoma compendium (Novartis): Gene expression profiles from the Novartis normal tissue compendium, as published in Su, A. I., Welsh, J. B., Sapinoso, L. M., Kern, S. G., Dimitrov, P., Lapp, H., Schultz, P. G., Powell, S. M., Moskaluk, C. A., Frierson, H. F., Jr., et al. (2001) Cancer Res. 61, 7388-7393.

C5 collection details

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project (www.geneontology.org). The gene sets are based on GO terms (go-basic.obo, downloaded on 3 May, 2016) and their associations to human genes (gene2go, downloaded from NCBI FTP server on 3 May, 2016).

Each GO term belongs to one of the three ontologies: molecular function (MF), cellular component (CC) or biological process (BP). A gene product might be associated with or located in one or more molecular functions. Each ontology captures a unique aspect of the gene product.

A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation must also include an evidence code to indicate how the annotation to a particular term is supported (http://geneontology.org/page/guide-go-evidence-codes).

GO gene sets for very broad categories, such as Biological Process, have been omitted. GO sets with fewer than 10 genes (NCBI Entrez Gene IDs) have also been omitted. We defined sets as "highly similar" if their Jaccard's coefficient was > 0.85. For each pair of the highly similar sets, we kept the largest set and repeated the procedure until all such pairs were resolved.

C7 collection details

Immunologic signatures collection (also called ImmuneSigDB) is composed of gene sets that represent cell types, states, and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.

We first capture relevant microarray datasets published in the immunology literature that have raw data deposited to Gene Expression Omnibus (GEO). For each published study, the relevant comparisons are identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions are created. All data is processed and normalized the same way to identify the gene sets, which correspond to the top or bottom genes (FDR < 0.25 or maximum of 200 genes) ranked by mutual information for each assigned comparison.

The immunologic signatures collection was generated as part of our collaboration with the Haining Lab at Dana-Farber Cancer Institute and the Human Immunology Project Consortium (HIPC). To cite your use of the collection, and for further information, please refer to Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte A, Mesirov JP, Haining WN, Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation, 2016, Immunity 44(1), 194-206.