Difference between revisions of "MSigDB v3.1 Release Notes"

Latest revision as of 03:10, 25 September 2016

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Improved mapping to common gene identifiers

MSigDB now uses human Entrez Gene IDs as the common gene identifiers for all gene sets. Gene sets come from a number of different sources and are originally specified using a variety of gene identifiers. The MSigDB gene sets are converted to a common set of gene identifiers so they can be used in GSEA and other tools. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol may refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we now use Entrez Gene IDs as the universal gene identifier for all MSigDB gene sets. Entrez Gene IDs identify genes uniquely and never change. For convenience, we continue to display gene sets as human gene symbols by default. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. Finally, we derive human gene symbols from the corresponding common gene identifiers for all standard uses. Note that all gene sets are also available as the original identifiers specified by the source and as Entrez Gene IDs.

Human Entrez Gene IDs and the corresponding symbols for the MSigDB v3.1 gene sets are based on gene_info.gz and gene_history.gz, downloaded from the Entrez Gene FTP site on November, 15, 2011. Mouse -> Human and Rat -> Human 1:1 orthologous relationships are from Mouse Genome Informatics (MGI).

MSigDB gene sets are subject to size and similarity restrictions. After mapping to human Entrez Gene IDs, filters were applied to exclude sets with:

- fewer than 5 genes (C2:CGP only)

- fewer than 10 (all other collections)

- more than 2,000 genes

- 90% or higher similarity (overlap) to other set(s) within a collection

New collection C6: Oncogenic Pathway Activation Modules

C6: Oncogenic Pathway Activation Modules is a new collection of 189 gene sets. These gene sets represent expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.

New gene sets curated from papers

1,035 new gene sets curated from papers were added to the C2:CGP (Chemical and Genetic Perturbations) sub-collection. In a review of the whole collection, 12 existing sets were renamed and 29 sets were deprecated. Renamed and deprecated sets are listed here.

New canonical pathway gene sets

The CP (Canonical Pathways) sub-collection has two new sources of gene sets. (1) 132 gene sets were collected from the Munich Information Center for Protein Sequences (MIPS) CORUM database, which provides a resource of manually annotated protein complexes from mammalian organisms. The gene sets correspond to human protein complexes extracted from the CORUM database released on February 17, 2012. (2) 196 gene sets were collected from the Pathway Interaction Database (PID), which is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. In collaboration with the PID resource we extracted the gene sets from the PID data file (uniprot.tab) downloaded on May 15, 2012.

The sets in the CP (Canonical Pathways): the entire Reactome sub-collection was updated to version 44 of Reactome. Reactome is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed. Deprecated sets are listed here.

Updates to C4: Cancer Modules

The gene sets in C4: Cancer Modules represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. They correspond to the modules described in Segal et al., 2004. For the MSigDB v3.1 release, these gene sets were re-mapped to gene symbols from the Entrez Gene IDs as they appeared in original source files prior to v2.5. 23 sets were deprecated because they contained fewer than 10 genes. Names of all other sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed here.

Updates to gene families

We fixed a discrepancy between the family of transcription factors and homeodomain proteins. All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins were not present in the transcription factors gene family. This has been fixed in the 3.1 release.

Viewing previous versions of MSigDB

The MSigDB v3.0 and v2.5 files are archived and are available at Downloads page. You can view them through the MSigDB Browser tool in the GSEA desktop application. Please see GSEA 2.0.8 Release Notes for details.

@@ Line 1: / Line 1: @@
+[http://www.broadinstitute.org/gsea/ GSEA Home] |
+[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |
+[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |
+[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
+[http://www.broadinstitute.org/gsea/contact.jsp Contact] <br />
+<br />
 <h2>Improved mapping to common gene identifiers </h2>
 <p> MSigDB now uses <strong> human Entrez Gene IDs </strong> as the <strong>common gene identifiers </strong>for all gene sets. Gene sets come from a number of different sources and are originally specified using a variety of gene identifiers. The MSigDB gene sets are converted to a common set of gene identifiers so they can be used in GSEA and other tools. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol may refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we now use Entrez Gene IDs as the universal gene identifier for all MSigDB gene sets. Entrez Gene IDs identify genes uniquely and never change. For convenience, we continue to display gene sets as human gene symbols by default. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. Finally, we derive human gene symbols from the corresponding common gene identifiers for all standard uses. Note that all gene sets are also available as the original identifiers specified by the source and as Entrez Gene IDs.</p>
-<p>Human Entrez Gene IDs and the corresponding symbols for the MSigDB v3.1 gene sets are based on <tt>gene_info.gz</tt> and <tt>gene_history.gz</tt>, downloaded from the [http://www.ncbi.nlm.nih.gov/gene Entrez Gene] FTP site on November, 15, 2011.</p>
+<p>Human Entrez Gene IDs and the corresponding symbols for the MSigDB v3.1 gene sets are based on <tt>gene_info.gz</tt> and <tt>gene_history.gz</tt>, downloaded from the [http://www.ncbi.nlm.nih.gov/gene Entrez Gene] FTP site on November, 15, 2011. Mouse -> Human and Rat -> Human 1:1 orthologous relationships are from [http://www.informatics.jax.org/ Mouse Genome Informatics (MGI)].</p>
-<p>Mouse -> Human and Rat -> Human 1:1 orthologous relationships are from [http://www.informatics.jax.org/ Mouse Genome Informatics (MGI)].</p>
 MSigDB gene sets are subject to <strong>size and similarity restrictions</strong>.
 After mapping to human Entrez Gene IDs, filters were applied to exclude sets with:
@@ Line 10: / Line 15: @@
 <ul>- more than 2,000 genes</ul>
 <ul>- 90% or higher similarity (overlap) to other set(s) within a collection</ul>
-<p>This page describes changes in Release 3.1 of the Molecular Signatures Database (MSigDB)</p>
 <h2>New collection C6: Oncogenic Pathway Activation Modules</h2>
@@ Line 16: / Line 20: @@
 <p><strong> C6: Oncogenic Pathway Activation Modules</strong> is a <strong>new collection</strong> of 189 gene sets. These gene sets represent expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.</p>
-<h2>Additions to C2: Curated Gene Sets </h2>
+<h2>New gene sets curated from papers </h2>
-<p>The C2: Curated Gene Sets collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. The MSigDB 3.1 release includes a number of new gene sets and other updates to this collection:<br>
+<strong>1,035 new gene sets </strong>curated from papers were added to the <strong>C2:CGP</strong> (Chemical and Genetic Perturbations) sub-collection. In a review of the whole collection, 12 existing sets were renamed and 29 sets were deprecated. Renamed and deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].
-<ul>
-    <li><strong>1,035 new gene sets </strong>curated from papers were added to the <strong>CGP</strong> (Chemical and Genetic Perturbations) sub-collection. In a review of the whole collection, 12 existing sets were renamed and 29 sets were deprecated.</li>
-   <li>The sets in the CP (Canonical Pathways): <strong> Reactome </strong> sub-collection were <strong>updated</strong> to version 44 of Reactome. [http://www.reactome.org/ Reactome] is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed.
+<h2>New canonical pathway gene sets </h2>
-        </li>
+<p> The CP (Canonical Pathways) sub-collection has two new sources of gene sets. (1) <strong> 132 gene sets </strong> were collected from the Munich Information Center for Protein Sequences <strong>(MIPS)</strong>
-         <li> CP (Canonical Pathways): <strong>MIPS</strong> is a <strong>new sub-collection</strong> of 132 sets collected from the Munich Information Center for Protein Sequences
   [http://mips.helmholtz-muenchen.de/genre/proj/corum CORUM] database, which provides a resource of manually annotated protein complexes from mammalian organisms. The gene sets correspond to human protein complexes extracted from the CORUM database released on February 17, 2012.
-        </li>
+(2) <strong> 196 gene sets </strong> were  collected from the
-         <li>CP (Canonical Pathways): <strong>PID</strong> is a <strong>new sub-collection</strong> of 196 sets collected from the
+[http://pid.nci.nih.gov/ Pathway Interaction Database] <strong>(PID)</strong>, which is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. In collaboration with the PID resource we extracted the gene sets from the PID data file (<tt>uniprot.tab</tt>) downloaded on May 15, 2012.
-[http://pid.nci.nih.gov/ Pathway Interaction Database (PID)], which is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. In collaboration with the PID resource we extracted the gene sets from the PID data file (<tt>uniprot.tab</tt>) downloaded on May 15, 2012.
+           </p>
-           </li>
+<p>The sets in the CP (Canonical Pathways): the entire <strong> Reactome </strong> sub-collection was <strong>updated</strong> to version 44 of Reactome. [http://www.reactome.org/ Reactome] is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed. Deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].
-    </ul>
+        </p>
-</p>
 <h2>Updates to C4: Cancer Modules</h2>
 <p>The gene sets in C4: Cancer Modules represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. They correspond to the modules described in [http://www.ncbi.nlm.nih.gov/pubmed/15448693 Segal et al., 2004].
-For the MSigDB v3.1 release, these gene sets were re-mapped to gene symbols from the Entrez Gene IDs as they appeared in original source files prior to v2.5. Twenty three sets were deprecated because they contained fewer than 10 genes. Names of all sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].</p>
+For the MSigDB v3.1 release, these gene sets were re-mapped to gene symbols from the Entrez Gene IDs as they appeared in original source files prior to v2.5. 23 sets were deprecated because they contained fewer than 10 genes. Names of all other sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].</p>
-<h2>Gene families</h2>
+<h2>Updates to gene families</h2>
 We fixed a discrepancy between the family of transcription factors and homeodomain proteins. All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins were not present in the transcription factors gene family.  This has been fixed in the 3.1 release.
 <h2>Viewing previous versions of MSigDB</h2>
 The MSigDB v3.0 and v2.5 files are archived and are available at [http://www.broadinstitute.org/gsea/downloads.jsp Downloads]  page. You can view them through the MSigDB Browser tool in the GSEA desktop application. Please see [[GSEA_v2.08._Release_Notes|GSEA 2.0.8 Release Notes]] for details.

Difference between revisions of "MSigDB v3.1 Release Notes"

Latest revision as of 03:10, 25 September 2016

Contents

Improved mapping to common gene identifiers

New collection C6: Oncogenic Pathway Activation Modules

New gene sets curated from papers

New canonical pathway gene sets

Updates to C4: Cancer Modules

Updates to gene families

Viewing previous versions of MSigDB

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

MSigDB

Software

Internal only

Tools