Difference between revisions of "MSigDB v3.1 Release Notes"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 4: Line 4:
  
 
<h3>C6: Oncogenic pathway activation modules <strong>(+189)</strong></h3>
 
<h3>C6: Oncogenic pathway activation modules <strong>(+189)</strong></h3>
<p> C6 is a new collection of gene sets representing expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, gain of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.</p>
+
<p> C6 is a new collection of gene sets representing expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.</p>
  
 
<h2>Updated gene set collections</h2>
 
<h2>Updated gene set collections</h2>

Revision as of 16:35, 10 October 2012

This page describes changes in Release 3.1 of the Molecular Signatures Database (MSigDB)

New gene set collection

C6: Oncogenic pathway activation modules (+189)

C6 is a new collection of gene sets representing expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.

Updated gene set collections

C2: Curated gene sets (+1,578)

The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.

  • CGP: chemical and genetic perturbations (+1,006)
  • There are 1,035 new sets curated from papers. From previous, v3.0 release: 2,351 sets remain unchanged, 12 sets were renamed, and 29 became deprecated (9 by size filters, 7 by high similarity filter, and 13 for other reasons during review process.

  • CP: canonical pathways (+572)
    • Reactome

      All sets were updated from v44 Reactome provided to MSigDB as part of our collaboration. Reactome is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed.

    • MIPS: Munich information center for protein sequences

      The CORUM database provides a resource of manually annotated protein complexes from mammalian organisms. The MIPS gene sets correspond to human protein complexes extracted from the CORUM database (Released on February 17, 2012). All 132 sets are new.

    • PID: Pathway Interaction Database

      The Pathway Interaction Database (PID) is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. As part of MSigDB collaboration with the PID resource, we extracted 196 gene sets extracted from the PID data (uniprot.tab file downloaded on May 15, 2012.

Renamed and deprecated sets are listed here.

C4: Computational gene sets (-23)

  • CM: cancer modules (-23 gene sets).

    Gene sets are identical to the modules described in Segal et al., 2004. The sets represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. Starting with a list of 2,849 gene sets from a variety of resources such as Gene Ontology, KEGG and others, the authors extracted 456 statistically significant regulatory modules from a large compendium of published microarray data spanning 22 tumor types.

    Original members of these sets were reverted to human Entrez Gene IDs as they appeared in original source files prior to v2.5 and the corresponding human gene symbols were derived thereafter. Twenty three sets were deprecated because they contained fewer than 10 human Entrez Gene IDs. Names of all sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed here.

    For further details, refer to MSigDB Collections.


For more information

For complete descriptions of all collections or to download the updated gene sets, go to the Browse Collections page.

Other changes

Gene symbol updates

Gene sets consist of a large variety of gene identifiers, called original members here. To use gene sets by GSEA and other querying tools, original members have to be converted to a common universal kind of gene identifiers. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol can often refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we chose Entrez Gene IDs as robust universal identifiers (called ezid members here. Entrez Gene IDs uniquely identify human genes and never change. For convenience, we continue displaying gene sets as made from human gene symbols. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. For this, we rely on a collection of Bioconductor Annotation packages and internal lookup tables.

We have updated gene symbols for all sets and families according to gene_info.gz and gene_history.gz files downloaded from Entrez Gene FTP site on November, 15, 2011.

Size and similarity restrictions

    After mapping to human Entrez Gene IDs, the following filters were applied to exclude sets with
      fewer than 5 genes (C2:CGP only)
      fewer than 10 (all other collections)
      more than 2,000 genes
      90% or higher similarity (overlap) to other set(s) within a collection

Gene family changes

Fixed a discrepancy between a family of transcription factors and homeodomain proteins.

All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins are not present among transcription factors. For this release, the transcription factors family now includes all genes annotated as homeodomain proteins.



Viewing previous database versions (v3.0 and v2.5)

The MSigDB v3.0 and v2.5 files are archived and are available at Downloads page. Users can view them through the MSigDB Browser tool in GSEA java desktop application. Please consult GSEA 2.0.8 Release Notes for details.