MSigDB v3.1 Release Notes
This page describes changes in Release 3.1 of the Molecular Signatures Database (MSigDB)
Contents
Gene set updates
The following describes the changes made to the gene set collections for MSigDB v3.1.
Size and similarity restrictions
- After mapping to human Entrez Gene IDs, the following filters were applied to exclude sets with
- fewer than 5 genes (C2:CGP only)
- fewer than 10 (all other collections)
- more than 2,000 genes
- 90% or higher similarity (overlap) to other set(s) within a collection
C1: Positional gene sets
No changes were made in the C1 gene sets other than updating their gene symbols.
For a description of this collection, refer to MSigDB Collections.
C2: Curated gene sets (+1,578)
The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.
Renamed and deprecated sets are listed here.
- CGP: chemical and genetic perturbations (+1,006)
- CP: canonical pathways (+572)
There are 1,035 new sets curated from papers. From previous, v3.0 release: 2,351 sets remain unchanged, 12 sets were renamed, and 29 became deprecated (9 by size filters, 7 by high similarity filter, and 13 for other reasons during review process.
From previous, v3.0 release: all BioCarta (217), KEGG (186) and non-Reactome sets (47) remain unchanged.
- Reactome
All sets were updated from v44 Reactome provided to MSigDB as part of our collaboration. Reactome is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed.
- MIPS: Munich information center for protein sequences
The CORUM database database provides a resource of manually annotated protein complexes from mammalian organisms. The MIPS gene sets correspond to human protein complexes extracted from the CORUM database (Released on February 17, 2012). All 132 sets are new.
- PID: Pathway Interaction Database
PID is maintained by National Cancer Institute and Nature Publishing Group. Pathway Interaction Database assembles iomolecular interactions and cellular processes into authoritative human signaling pathways. As part of MSigDB collaboration with the PID resource, we are happy to include gene sets extracted from PID. All 196 sets are new.
C3: Motif gene sets
No changes were made in the C3 gene sets other than updating gene symbols.
Gene sets in the C3 collection consist of genes sharing a cis-regulatory motif.
- TFT: transcription factor targets
These sets share upstream cis-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.
We extracted 460 mammalian transcriptional regulatory motifs from v7.4 TRANSFAC database. We then generated the motif gene sets consisting of the inferred target genes for each motif. Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif. We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01. The set’s full description is the TRANSFAC entry for the matching matrix, in a format described here.
Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif m of 174 upstream motifs highly conserved among five mammalian species. The motifs are catalogued in Xie et al., 2005 and represent potential transcription factor binding sites. Each motif gene set consists of all human genes whose promoters (defined as regions -kb to +2kb around transcription start site) contained at least one conserved instance of motif m. If the motif’s sequence matched a transcription factor binding site documented in the TRANSFAC database (see above), then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif m and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01). The set’s full description in this case is the TRANSFAC entry for the matching matrix. If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the sequence of motif m.
- MIR: microRNA targets
These gene sets consist of the inferred target gene for each motif m of 221 3'-UTR motifs highly conserved among five mammalian species. The motifs are catalogued catalogued in Xie et al., 2005 and represent potential microRNA binding sites. Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif m.
C4: Computational gene sets (-23)
- CM: cancer modules (-23 gene sets).
Gene sets are identical to the modules described in Segal et al., 2004. The sets represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. Starting with a list of 2,849 gene sets from a variety of resources such as Gene Ontology, KEGG and others, the authors extracted 456 statistically significant regulatory modules from a large compendium of published microarray data spanning 22 tumor types.
Original members of these sets were reverted to human Entrez Gene IDs as they appeared in original source files prior to v2.5 and the corresponding human gene symbols were derived thereafter. Twenty three sets were deprecated because they contained fewer than 10 human Entrez Gene IDs. Names of all sets were changed to upper case font to match the naming convention throughout MSigDB.
For further details, refer to MSigDB Collections.
- CGN: cancer gene neighborhoods
No changes were made in the C4:CGN gene sets other than updating their gene symbols.
Starting with a curated list of 380 cancer-associated genes (Brentani et al., 2003), Subramanian, Tamayo et al., 2005 mined four expression compendia for correlated gene sets. Gene neighborhoods with fewer than 25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets.
- Gene set names indicate the corresponding expression compendia and the seed cancer-associated genes:
- GNF2: Novartis normal human tissue gene expression compendium (Su et al., 2004)
- CAR: Novartis carcinoma gene expression compendium (Su et al., 2001)
- GCM: Global cancer map compendium (Ramaswamy et al., 2001)
- MORF: A large internal compendium of gene expression data sets, including many of in-house Affymetrix U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc. (Subramanian, Tamayo et al., 2005)
C5: Gene Ontology gene sets
No changes were made in the C5 gene sets other than updating gene symbols.
For a description of this collection, see the MSigDB Collections page.
C6: Oncogenic pathway activation modules (+189)
This collection is new. Gene sets are expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, gain of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.
For more information
For complete descriptions of all collections or to download the updated gene sets, go to the Browse Collections page.
Other changes
Gene symbol updates
Gene sets consist of a large variety of gene identifiers, called original members here. To use gene sets by GSEA and other querying tools, original members have to be converted to a common universal kind of gene identifiers. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol can often refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we chose Entrez Gene IDs as robust universal identifiers (called ezid members here. Entrez Gene IDs uniquely identify human genes and never change. For convenience, we continue displaying gene sets as made from human gene symbols. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. For this, we rely on a collection of Bioconductor Annotation packages and internal lookup tables.
We have updated gene symbols for all sets and families according to gene_info.gz and gene_history.gz files downloaded from Entrez Gene FTP site on November, 15, 2011.
Gene family changes
Fixed a discrepancy between a family of transcription factors and homeodomain proteins.
All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins are not present among transcription factors. For this release, the transcription factors family now includes all genes annotated as homeodomain proteins.
Organism annotations
We continue using scientific names to indicate source organism throughout MSigDB. Organism information corresponds to species annotation associated with original members.
Continued support for various GMT files
- human gene symbols: contain the word symbols in their names
For standard GSEA analysis, no change is expected: just continue using these files as before. Starting with v3.1, all human gene symbols are derived from human Entrez Gene IDs. These files should serve for all standard analytical purposes, such as the default source of gene sets for GSEA.
- original gene identifiers: contain the word orig in their names
These files contain original members - identifiers reported exactly as they appear in the sources of gene sets. Because original identifiers are from a variety of platforms, we do not recommend using them for routine GSEA analysis. Instead, these files should serve as a reference and for uses other than standard GSEA. In the previous release, these gene sets consisted of Entrez Gene IDs that were not necessarily human.
- human Entrez Gene IDs: contain the word entrez in their names
While Entrez Gene IDs are more robust and reliable identifiers that gene symbols, they are much less convenient for standard purposes.
Viewing previous database versions (v3.0 and v2.5)
The MSigDB v3.0 and v2.5 files are archived and are available at Downloads page. Users can view them through the MSigDB Browser tool in GSEA java desktop application. Please consult GSEA 2.0.8 Release Notes for details.