MSigDB collections

From GeneSetEnrichmentAnalysisWiki

Revision as of 18:46, 2 November 2016 by Liberzon (Talk | contribs)
Jump to: navigation, search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This page provides detailed descriptions of all collections of gene sets in MSigDB.

To learn about changes and other information specific for a particular release of MSigDB, please refer to the corresponding Release_Notes.


H: hallmark gene sets

First introduced in v5.0 MSigDB.

We envision this collection as the starting point for exploring MSigDB resource and GSEA.

Hallmark gene sets represent specific well defined biological states or processes and display coherent expression. The hallmark gene sets were generated by a computational methodology based on identifying gene set overlaps and extracting coherent representatives of them. Details of the procedure will become available after the manuscript describing it is accepted for publication. The hallmark gene sets reduce noise and redundancy and provide a better biological space for GSEA and other gene set-based analyses of genomic data.

This collection is an initial release of 50 hallmarks which condense information from over 4,000 original overlapping gene sets from v4.0 MSigDB collections C1 through C6. We refer to the original gene sets as “founder” sets.

Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures.

To cite your use of the collection, and for further information, please refer to

Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P.
The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Cell Syst. 2015 Dec 23;1(6):417-425. PMID: 26771021

C1: positional gene sets

First introduced in v1.0 MSigDB (the initial release).

Genes from the same genomic location (chromosome or cytogenetic band) are grouped in a gene set. Cytogenetic annotations are from three sources:

  1. Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC)
  2. UniGene
  3. Affymetrix microarray annotations

We merged the relevant annotations from these resources and derived a single cytogenetic band location for every gene symbol. These were then grouped into sets. Decimals in cytogenetic bands were ignored. For example, 5q31.1 was considered 5q31. Therefore, genes annotated as 5q31.2 and those annotated as 5q31.3 were both placed in the same set, 5q31.

When there were conflicts, the UniGene entry was used.

These sets are helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.

C2: curated gene sets

First introduced in v1.0 MSigDB (the initial release).

Gene sets collected from various sources such as online pathway databases, scientific publications and personal contributions from domain experts.

CGP: chemical and genetic perturbations

  • Sets curated from biomedical literature

    Over the past few years, microarray studies have identified signatures of several important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). These gene sets are valuable biological results. Unfortunately, because gene sets are typically published as tables in a paper, the important biological findings they represent are not easily accessible to computational tools. Our first goal was to convert published gene sets into an electronic form. Towards this we compiled a list of microarray articles with published gene expression signatures. From each article, we extracted one or more gene sets from tables in the main text or supplementary information. Notably, our focus was on capturing the identity (e.g. gene symbol, GenBank accession) of all members in a gene set rather than on relationships between individual genes.

    A number of these gene sets come in pairs: an xxx_UP (xxx_DN) gene set representing genes induced (repressed) by the perturbation.

    Names of CGP sets start with the last name of the first author of the source publication. The majority of CGP sets were curated from publications and include links to the PubMed citation, the exact source of the set (e.g., Table 1) and links to the corresponding raw data in GEO or ArrayExpress repositories. When the set involves a genetic perturbation, brief description includes a link to the gene's entry in the NCBI Entrez Gene database. When the set involves a chemical perturbation, brief description includes a link to the chemical's entry in the NCBI [ PubChem Compound] database.

    • curated by the MSigDB curation team
    • contributed by the L2L database
    • These sets came from the L2L database of published microarray gene expression data (described in Newman and Weiner) and were kindly shared with MSigDB. These sets list John Newman as the contributor.

  • Sets contributed by Dr. Chi Dang from the MYC Target Gene Database.
  • Individual gene set compilations contributed by MSigDB collaborators. These sets usually are not based on any specific publication.

CP: canonical pathways

Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. These gene sets were either extracted from the online sources by MSigDB curation team or were contributed by teams of pathway databases in collaboration with MSigDB curation team.

C3: motif gene sets

First introduced in v1.0 MSigDB (the initial release).

Gene sets group genes by cis-regulatory motifs. The motifs are catalogued in Xie et al. and represent known or putative conserved regulatory elements in promoters and 3’-UTR regions. These sets make it possible to link changes in a genomic experiment to a conserved, putative cis-regulatory elements.

Transcription factor targets (TFT)

These sets share upstream cis-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.

Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif m of 174 upstream motifs highly conserved among five mammalian species (H. sapiens, M. musculus, R. norvegicus and C. lupus familiaris). The motifs are catalogued in Xie, et al. (2005, Nature 434, 338–345) and represent potential transcription factor binding sites. Each motif gene set consists of all human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contained at least one conserved instance of motif m. If the motif’s sequence matched a transcription factor binding site documented in v7.4 TRANSFAC database, then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene set name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif m and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01). The set’s full description in this case is the TRANSFAC entry for the matching matrix. If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the consensus sequence of motif m.

We also extracted 460 mammalian transcriptional regulatory motifs from v7.4 TRANSFAC database. We then generated the motif gene sets consisting of the inferred target genes for each motif. Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif. We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01. The set’s full description is the TRANSFAC entry for the matching matrix, in a format described here.

Brief descriptions of sets in this collection follow this syntax: "Genes with promoter regions [-2kb,2kb] around transcription start site containing motif MOTIFSEQ". If the motif's sequence matched a TRANSFAC binding matrix and the name of the matrix corresponded to a human gene symbol, then the text continued as follows: "... which matches annotation for GENE_SYMBOL: FULL GENE NAME." Otherwise, the text continues as follows: "Motif does not match any known transcription factor".

microRNA Targets (MIR)

These gene sets consist of the inferred target gene for each motif m of 221 3'-UTR motifs highly conserved among five mammalian species (H. sapiens, M. musculus, R. norvegicus and C. lupus familiaris). The motifs are catalogued catalogued in Xie, et al. 2005, Nature 434, 338–345 and represent potential microRNA binding sites. Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif m.

C4: computational gene sets

First introduced in v1.0 MSigDB (the initial release).

Gene sets defined by mining large collections of cancer-oriented genes.

CGN: Cancer Gene Neighborhoods

These sets are defined by expression neighborhoods centered on cancer-related genes. This collection has originally been described in Subramanian, Tamayo et al. 2005.

Starting with a list of 380 cancer associated genes curated from internal resources and a published cancer gene database [REF], Subramanian, Tamayo et al. 2005 mined four expression compendia datasets for correlated gene sets. Using the profile of a given gene as a template, Subramanian and Tamayo ordered every other gene in the data set by its Pearson correlation coefficient. A cutoff of R ≤ 0.85 was then applied to extract correlated genes. The calculation of neighborhoods was done independently in each data set. In this way, a given oncogene may have up to four "types" of neighborhoods according to the correlation present in each compendium. Neighborhoods with ≶ 25 genes at this threshold were omitted yielding the final 427 sets.

The names of these gene sets start with a code indicating the corresponding expression compendium followed by the symbol of the cancer associated gene.

The compendia and their codes are listed below:

CM: Cancer Modules

First introduced in v2.5 MSigDB.

Gene sets are identical to the modules described in Segal et al. Starting with 2,849 gene sets from a variety of resources such as KEGG, Gene Ontology, and others, the authors mined a large compendium of cancer related microarray data and identified 456 transcriptionally co-regulated modules. Two sets with fewer than 10 NCBI human Entrez Gene IDs have been deprecated since v3.0 MSigDB.

The names of these sets start with MODULE_ followed by the number of module according to the contributor's notes. Gene set pages contain external links to further details about these sets.

C5: GO gene sets

First introduced in v2.5 MSigDB. The collection underwent complete overhaul in v5.2 MSigDB.

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project.

The Gene Ontology project is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and their products. Each entry in GO (a GO term) has a unique numerical identifier of the form GO:nnnnnnn, and a term name. Each term is also assigned to one of the three ontologies, molecular function, cellular component or biological process. The ontologies are structured as directed acyclic graphs that represent a network in which a child (i.e., more specialized) term can have one or more parents (i.e., less specialized) terms. Every GO term must obey the true path rule: if the child term describes a gene product, then all its parent terms must also apply to that gene product. Annotation is the process of assigning GO terms to gene products. A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation must also include an evidence code to indicate how the annotation to a particular term is supported.

There are three sub-collections according to three key biological domains in GO.

CC: GO Cellular component.

Gene sets derived from the Cellular Component (CC) ontology. This ontology describes the location of a gene product, within cellular structures and within macromolecular complexes.

MF: GO Molecular function.

Gene sets are derived from the Molecular Function (MF) ontology. This ontology describes what a gene product does at the biochemical level. It describes only what is done without specifying where or when the event actually occurs or its broader context.

BP: GO Biological process.

Gene sets derived from the Biological Process (BP) ontology. This ontology describes a broad biological objective to which the gene product contributes. A process is accomplished via one or more ordered assemblies of functions. It often involves transformation in the sense that something goes into a process and something different comes out of it.

Outline of the procedure

All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.

The input files for data are:

  • gene2go

    the file used for v2.5 was downloaded on January 22, 2008)

    the file used for v5.2 was downloaded on May 3, 2016

  • This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.

  • gene ontology

    the file used for v2.5, gene_ontology_edit_obo, was downloaded on January 25, 2008)

    the file used for v5.2, go-basic.obo, was downloaded on May 3, 2016

  • This file contains the entire GO ontology in OBO v.1.2 format [1]. The file is produced by Gene Ontology Consortium and is updated every 30 minutes [2]. Monthly releases are also available [3]. OBO is the plain text file format used by OBO-Edit, the open source, platform-independent application for viewing and editing ontologies.

v2.5 MSigDB (2008)

First, for each GO term we got the corresponding human genes from the gene2go file. For v2.5, we have retrieved only associations with manually assigned evidence codes (IDA, IPI, IMP, IGI, IEP, ISS and TAS) because they usually reflect stronger evidence for the association of a given gene product with the corresponding GO term. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 10 genes. Next we removed very large sets for extremely broad GO terms by manually inspecting large sets corresponding to GO terms at depths 2 and 3. Finally, we resolve redundancies by applying the following rules:

  • a set equals parental GO term set: keep the parent, discard the child GO term
  • a set equals many parental GO term sets: keep the child, discard the parent GO terms
  • equal sets are siblings: discard both
  • equal sets do not have a child - parent relationship; discard both
v5.2 MSigDB (2016)

First, for each GO term we got the corresponding human genes from the gene2go file. This time, no associations were filtered by evidence codes. Instead, we have specifically excluded associations with the root GO terms (GO:0008150, GO:0005575, and GO:0003674 and the 'unknown' GO terms (GO:0000004, GO:0005554, and GO:0008372). Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 10 or more than 2,000 Entrez Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard's coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard's coefficients was greater than 0.85. We then grouped pairs of highly similar sets into "chunks" according to their GO terms and applied two rounds of filtering for every "chunk". The first round was computational: we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar pairs of sets of identical sizes, which we further pruned manually by preferably keeping the more general set (i.e., the set with the more general GO term in the ontology tree).

C6: oncogenic signatures

First introduced in v3.1 MSigDB.

Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from in house unpublished expression profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures was curated from scientific publications.

C7: immunologic signatures

First introduced in v4.0 MSigDB, this collection is also called ImmuneSigDB, and is a compendium of gene sets that represent cell types, states and perturbations involving the immune system.

We captured Affymetrix microarray datasets (human or mouse samples) published in the immunology literature that have the data deposited to Gene Expression Omnibus (GEO). We reviewed each microarray dataset in the context of its study and defined meaningful pairwise comparisons (e.g. WT vs. KO; pre- vs. post-treatment etc.) that would lead to biologically useful gene sets. Only meaningful, rather that all possible comparisons, were made. All data was processed uniformly. For each two-class comparison, the genes were ranked according to an information-based similarity metric (RNMI) from top up-regulated to bottom down-regulated genes in the two groups. Gene sets comprised genes differentially expressed with an FDR < 0.02, and a maximum number of genes was set at 200 (i.e., all gene sets had at most 200 differentially expressed genes). This way we generated two gene sets from each assigned comparison of two groups—‘‘Group_A_vs_Group_B_UP’’ and ‘‘Group_A_vs_Group_B_DN,’’ for the top up-regulated and bottom down-regulated genes, respectively, identified for the genes most different in the samples in group A compared to the samples in group B.

This collection was generated as part of our collaboration with the Haining Lab at Dana-Farber Cancer Institute and the Human Immunology Project Consortium (HIPC).

To cite your use of the collection, and for further information, please refer to
Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte A, Mesirov JP, Haining WN.
Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation.
Immunity. 2016 Jan 19; 44(1): 194-206. PMID: 26795250

Personal tools