Difference between revisions of "MSigDB v3.1 Release Notes"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
m
 
(110 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<p>This page describes changes in Release 3.1 of the Molecular Signatures Database (MSigDB)</p>
+
[http://www.broadinstitute.org/gsea/ GSEA Home] |
<h2>Gene Symbols Update</h2>
+
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |
<p> Gene sets consist of a large variety of gene identifiers, called <strong>original members</strong> here. To use gene sets by GSEA and other querying tools, original members have to be converted to a common universal kind of gene identifiers. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol can often refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we chose Entrez Gene IDs as robust universal identifiers (called <strong>ezid members</strong> here. Entrez Gene IDs uniquely identify human genes and never change. For convenience, we continue displaying gene sets as made from human gene symbols. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. For this, we rely on a collection of Bioconductor Annotation packages and internal lookup tables.</p>
+
[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |
<p>We have updated gene symbols for all sets and families according to <tt>gene_info.gz</tt> and <tt>gene_history.gz</tt> files downloaded from [http://www.ncbi.nlm.nih.gov/gene Entrez Gene] FTP site on November, 15, 2011.</p>
+
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
 +
[http://www.broadinstitute.org/gsea/contact.jsp Contact] <br />
 +
<br />
 +
<h2>Improved mapping to common gene identifiers </h2>
  
<h2>Gene Sets Update</h2>
+
<p> MSigDB now uses <strong> human Entrez Gene IDs </strong> as the <strong>common gene identifiers </strong>for all gene sets. Gene sets come from a number of different sources and are originally specified using a variety of gene identifiers. The MSigDB gene sets are converted to a common set of gene identifiers so they can be used in GSEA and other tools. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol may refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we now use Entrez Gene IDs as the universal gene identifier for all MSigDB gene sets. Entrez Gene IDs identify genes uniquely and never change. For convenience, we continue to display gene sets as human gene symbols by default. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. Finally, we derive human gene symbols from the corresponding common gene identifiers for all standard uses. Note that all gene sets are also available as the original identifiers specified by the source and as Entrez Gene IDs.</p>
The following describes the changes made to the gene set collections for MSigDB v3.1. <br />
+
<p>Human Entrez Gene IDs and the corresponding symbols for the MSigDB v3.1 gene sets are based on <tt>gene_info.gz</tt> and <tt>gene_history.gz</tt>, downloaded from the [http://www.ncbi.nlm.nih.gov/gene Entrez Gene] FTP site on November, 15, 2011. Mouse -> Human and Rat -> Human 1:1 orthologous relationships are from [http://www.informatics.jax.org/ Mouse Genome Informatics (MGI)].</p>
<h3>Size Filtering</h3>
+
MSigDB gene sets are subject to <strong>size and similarity restrictions</strong>.
<p>After mapping to human Entrez Gene IDs, sets with fewer than five (C2:CGP) or 10 (all remaining collections) genes were not included in the v3.0 release. </p>
+
After mapping to human Entrez Gene IDs, filters were applied to exclude sets with:
 +
<ul>- fewer than 5 genes (C2:CGP only)</ul>
 +
<ul>- fewer than 10 (all other collections)</ul>
 +
<ul>- more than 2,000 genes</ul>
 +
<ul>- 90% or higher similarity (overlap) to other set(s) within a collection</ul>
  
<h3><font face="Arial">C1: Positional gene sets</font></h3>
+
<h2>New collection C6: Oncogenic Pathway Activation Modules</h2>
<p>No changes were made in the C1 gene sets other than updating their gene symbols.</p>
 
<p>For a description of this collection, see the  [http://www.broad.mit.edu/gsea/msigdb/collections.jsp Browse  Collections] page.</p>
 
  
<h3>C2: Curated gene sets (+1,380)</h3>
+
<p><strong> C6: Oncogenic Pathway Activation Modules</strong> is a <strong>new collection</strong> of 189 gene sets. These gene sets represent expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.</p>
The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.&nbsp; Gene sets in this collection have been extensively revised and expanded.<br />
 
Note that all the gene set names for C2 have changed.&nbsp; Many of the names  used in v2.5 were confusing or wrong, so these have been clarified or  corrected.&nbsp; For CGP, the new naming convention is that all gene set  names begin with the surname of the first author of the source paper.&nbsp; For CP, the names now begin with the contributor organization.<br />
 
<ul>
 
    <li><strong>CGP</strong>: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a>  for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release.&nbsp; All these gene sets have been verified against the original sources.&nbsp; During the reviewing process, we have:
 
    <ul>
 
        <li>renamed gene sets to follow consistent conventions throughout the whole collection</li>
 
        <li>wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection</li>
 
        <li>validated and corrected, if necessary, every attribute for each existing gene set</li>
 
        <li>added the exact source of the gene set (e.g., Table 1)</li>
 
        <li>added GEO or ArrayExpress ID when available</li>
 
        <li>added links to human Entrez Gene entries and PubChem Compound entries as appropriate</li>
 
        <li>used the original gene identifiers as reported in the source paper (not all gene sets did this originally)<br />
 
        </li>
 
        <li>resolved cases of redundant gene sets</li>
 
    </ul>
 
    In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions.     </li>
 
    <li><strong>CP</strong>: canonical pathways (880 gene sets).
 
    <ul>
 
        <li>We have deprecated all gene sets:
 
        <ul>
 
            <li>from GenMAPP gene sets because the majority of them in the previous release are based on KEGG or GO information that we already have</li>
 
            <li>from GO in this collection because they are already represented by C5</li>
 
            <li>based on NetAffx annotations because these are largely based on GO and thus are already represented by C5</li>
 
            <li>with untraceable origins</li>
 
        </ul>
 
        </li>
 
        <li>We have replaced all existing BioCarta and KEGG gene sets with updated versions from these resources.</li>
 
        <li>In collaboration with Reactome, we added 430 new canonical pathway sets</li>
 
        <li>To reduce redundancy in canonical pathways from BioCarta, KEGG, and Reactome, we developed and applied the following filters:
 
        <ul>
 
            <li>Source priority: KEGG &gt; Reactome &gt; BioCarta</li>
 
            <li>Size priority: keep the set with the smaller size</li>
 
            <li>Name length priority: keep the set with the shorter name</li>
 
            <li>External ID priority: keep the set with the smaller ID (applied to Reactome sets only)  </li>
 
        </ul>
 
        </li>
 
        <li>For convenience, we have organized gene sets from BioCarta, KEGG, and Reactome as separate, third-level divisions within C2:CP</li>
 
    </ul>
 
    </li>
 
</ul>
 
<h3>C3: Motif gene sets (-1)</h3>
 
<p>No changes were made in the C3 gene sets other than updating gene symbols.</p>
 
<p>Gene sets in the C3 collection consist of genes sharing a cis-regulatory motif.  This collection contains the following two subcollections:</p>
 
<h4>Transcription factor targets (TFT)</h4>
 
<p>These sets share upstream cis-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.</p>
 
<p>We extracted 460 mammalian transcriptional regulatory motifs from v7.4 [http://www.gene-regulation.com/ TRANSFAC] database.  We then generated the motif gene sets consisting of the inferred target genes for each motif.  Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif.  We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01.  The set’s full description is the TRANSFAC entry for the matching matrix, in a format described [http://www.gene-regulation.com/pub/databases/transfac/doc/matrix1SM.html here].</p>
 
<p>Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif <strong>m</strong> of 174 upstream motifs highly conserved among five mammalian species. The motifs are catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. (2005, Nature 434, 338–345)] and represent potential transcription factor binding sites.  Each motif gene set consists of all human genes whose promoters (defined as regions -kb to +2kb around transcription start site) contained at least one conserved instance of motif <strong>m</strong>. If the motif’s sequence matched a transcription factor binding site documented in the TRANSFAC database (see above), then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif m and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01).  The set’s full description in this case is the TRANSFAC entry for the matching matrix.  If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the sequence of motif <strong>m</strong>.</p>
 
  
<h4>microRNA Targets (MIR)</h4>
+
<h2>New gene sets curated from papers </h2>
<p>These gene sets consist of the inferred target gene for each motif <strong>m</strong> of 221 3'-UTR motifs highly conserved among five mammalian species. The motifs are catalogued catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. (2005, Nature 434, 338–345)] and represent potential microRNA binding sites.  Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif <strong>m</strong>.</p>
+
<strong>1,035 new gene sets </strong>curated from papers were added to the <strong>C2:CGP</strong> (Chemical and Genetic Perturbations) sub-collection. In a review of the whole collection, 12 existing sets were renamed and 29 sets were deprecated. Renamed and deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].
  
<h3>C4: Computational gene sets (-23)</h3>
+
       
Original members of these sets were reverted to human Entrez Gene IDs as they appeared in original source files prior to v2.5 and the corresponding human gene symbols were derived thereafter. Twenty three sets were deprecated because they contained fewer than 10 human Entrez Gene IDs. Names of all sets were changed to upper case font to match the naming convention throughout MSigDB. For a description of this  collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
+
<h2>New canonical pathway gene sets </h2>
<h4>Cancer gene neighborhoods (CGN)</h4>
+
<p> The CP (Canonical Pathways) sub-collection has two new sources of gene sets. (1) <strong> 132 gene sets </strong> were collected from the Munich Information Center for Protein Sequences <strong>(MIPS)</strong>
<p>No changes were made in the C4:CGN gene sets other than updating their gene symbols.</p>
+
[http://mips.helmholtz-muenchen.de/genre/proj/corum CORUM] database, which provides a resource of manually annotated protein complexes from mammalian organisms. The gene sets correspond to human protein complexes extracted from the CORUM database released on February 17, 2012.  
<p>Starting with a curated list of 380 cancer-associated genes ([http://www.ncbi.nlm.nih.gov/pubmed/14593198 Brentani et al. 2003, Proc. Natl. Acad. Sci. USA 100, 13418-13423]), Subramanian, Tamayo et al. ([http://www.ncbi.nlm.nih.gov/pubmed/16199517 2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550]) mined four expression compendia for correlated gene sets. Gene neighborhoods with  fewer than 25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets.</p>
+
(2) <strong> 196 gene sets </strong> were  collected from the
<ul>Gene set names indicate the corresponding expression compendia and the seed cancer-associated genes:
+
[http://pid.nci.nih.gov/ Pathway Interaction Database] <strong>(PID)</strong>, which is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. In collaboration with the PID resource we extracted the gene sets from the PID data file (<tt>uniprot.tab</tt>) downloaded on May 15, 2012.
<li><strong>GNF2:</strong> Novartis normal human tissue gene expression compendium ([http://www.ncbi.nlm.nih.gov/pubmed/15075390 Su, et al. 2004 Proc. Natl. Acad. Sci. USA 101, 6062-6067])</li>
+
          </p>
<li><strong>CAR:</strong> Novartis carcinoma gene expression compendium ([http://www.ncbi.nlm.nih.gov/pubmed/11606367 Su, et al. 2001 Cancer Res. 61, 7388-7393])</li>
+
<p>The sets in the CP (Canonical Pathways): the entire <strong> Reactome </strong> sub-collection was <strong>updated</strong> to version 44 of Reactome. [http://www.reactome.org/ Reactome] is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed. Deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].
<li><strong>GCM:</strong> Global cancer map compendium ([http://www.ncbi.nlm.nih.gov/pubmed/11742071 Ramaswamy, et al. 2001 Proc. Natl. Acad. Sci. USA 98, 15149-15154])</li>
+
        </p>
<li><strong>MORF:</strong> A large internal compendium of gene expression data sets, including many of in-house Affymetrix U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc. ([http://www.ncbi.nlm.nih.gov/pubmed/16199517 Subramanian, Tamayo et al. 2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550])</li>
 
</ul>
 
  
<h3>C5: Gene Ontology gene sets </h3>
+
<h2>Updates to C4: Cancer Modules</h2>
<p>No changes were made in the C5 gene sets other than updating gene symbols. For a description of this collection, see the [http://www.broad.mit.edu/gsea/msigdb/collections.jsp Browse  Collections] page.</p>
+
<p>The gene sets in C4: Cancer Modules represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. They correspond to the modules described in [http://www.ncbi.nlm.nih.gov/pubmed/15448693 Segal et al., 2004].
 +
For the MSigDB v3.1 release, these gene sets were re-mapped to gene symbols from the Entrez Gene IDs as they appeared in original source files prior to v2.5. 23 sets were deprecated because they contained fewer than 10 genes. Names of all other sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed [[Mapping_between_v3.1_and_v3.0_gene_sets|here]].</p>
  
<h3>For more information</h3>
+
<h2>Updates to gene families</h2>  
For complete descriptions of all collections or to download the updated gene sets, go to the [http://www.broad.mit.edu/gsea/msigdb/collections.jsp Browse  Collections] page.
+
We fixed a discrepancy between the family of transcription factors and homeodomain proteins. All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins were not present in the transcription factors gene family. This has been fixed in the 3.1 release.
  
<h2>Other Updates</h2>
+
<h2>Viewing previous versions of MSigDB</h2>  
<h3>Organism annotation</h3>
+
The MSigDB v3.0 and v2.5 files are archived and are available at [http://www.broadinstitute.org/gsea/downloads.jsp Downloads]  page. You can view them through the MSigDB Browser tool in the GSEA desktop application. Please see [[GSEA_v2.08._Release_Notes|GSEA 2.0.8 Release Notes]] for details.
We continue using scientific names to indicate source organism throughout MSigDB. Organism information corresponds to species annotation associated with original members.
 
 
 
<h3>Continued support of a variety of GMT files</h3>
 
<ul>
 
<li>human gene symbols: contain the word <strong>symbols</strong> in their names
 
<p>For standard GSEA analysis, no change is expected: just continue using these files as before. Starting with v3.1, all human gene symbols are derived from human Entrez Gene IDs. These files should serve for all standard analytical purposes, such as the default source of gene sets for GSEA.</p>
 
</li>
 
<li>original gene identifiers: contain the word <strong>orig</strong> in their names
 
<p>These files contain original members - identifiers reported exactly as they appear in the sources of gene sets. Because original identifiers are from a variety of platforms, we do not recommend using them for routine GSEA analysis. Instead, these files should serve as a reference and for uses other than standard GSEA.</p>
 
</li>
 
<li>human Entrez Gene IDs: contain the word <strong>entrez</strong> in their names
 
<p>While Entrez Gene IDs are more robust and reliable identifiers that gene symbols, they are much less convenient for standard purposes.</p>
 
</li>
 
</ul>
 
 
 
<h3>Compute Overlaps Error Corrected</h3>
 
Several users noted an error in calculations of p-values in the Compute Overlaps tool at the MSigDB web site. Thanks to these reports, we have corrected the error.<br />
 
 
 
<h3>MSigDB v3.0 and v2.5 Files</h3>
 
The MSigDB v3.0 and v2.5 files are archived and are available at [http://www.broadinstitute.org/gsea/downloads.jsp Downloads]  page. Users can view them through the MSigDB Browser tool in GSEA java desktop application. Please consult [[GSEA_v2.08._Release_Notes|GSEA 2.0.8 Release Notes]] for details.
 

Latest revision as of 02:10, 25 September 2016

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Improved mapping to common gene identifiers

MSigDB now uses human Entrez Gene IDs as the common gene identifiers for all gene sets. Gene sets come from a number of different sources and are originally specified using a variety of gene identifiers. The MSigDB gene sets are converted to a common set of gene identifiers so they can be used in GSEA and other tools. Previous releases of MSigDB used human gene symbols for this purpose. Researchers prefer working with gene symbols because they can easily recognize, remember and put them in the context of their work. Unfortunately, a gene usually has multiple different symbols. Conversely, the same symbol may refer to a number of different genes. Finally, gene symbols change frequently. To overcome these issues, we now use Entrez Gene IDs as the universal gene identifier for all MSigDB gene sets. Entrez Gene IDs identify genes uniquely and never change. For convenience, we continue to display gene sets as human gene symbols by default. However, the symbols are now unambiguously derived from the corresponding human Entrez Gene IDs. For non-human original members, we first convert them to the organism-specific Entrez Gene IDs and then seek their orthologous counterparts as human Entrez Gene IDs. Finally, we derive human gene symbols from the corresponding common gene identifiers for all standard uses. Note that all gene sets are also available as the original identifiers specified by the source and as Entrez Gene IDs.

Human Entrez Gene IDs and the corresponding symbols for the MSigDB v3.1 gene sets are based on gene_info.gz and gene_history.gz, downloaded from the Entrez Gene FTP site on November, 15, 2011. Mouse -> Human and Rat -> Human 1:1 orthologous relationships are from Mouse Genome Informatics (MGI).

MSigDB gene sets are subject to size and similarity restrictions. After mapping to human Entrez Gene IDs, filters were applied to exclude sets with:

    - fewer than 5 genes (C2:CGP only)
    - fewer than 10 (all other collections)
    - more than 2,000 genes
    - 90% or higher similarity (overlap) to other set(s) within a collection

New collection C6: Oncogenic Pathway Activation Modules

C6: Oncogenic Pathway Activation Modules is a new collection of 189 gene sets. These gene sets represent expression signatures derived directly from microarray data from experiments involving gain or loss of function of several established cancer genes in well defined, "clean" experimental systems. In this context, gain of function stands for increased activity of a cancer gene by means of over-expression or treatment with a chemical modulator. Conversely, loss of function stands for diminished activity of the cancer gene by means of RNAi knockdown, gene knockout, or enzymatic inhibition.

New gene sets curated from papers

1,035 new gene sets curated from papers were added to the C2:CGP (Chemical and Genetic Perturbations) sub-collection. In a review of the whole collection, 12 existing sets were renamed and 29 sets were deprecated. Renamed and deprecated sets are listed here.


New canonical pathway gene sets

The CP (Canonical Pathways) sub-collection has two new sources of gene sets. (1) 132 gene sets were collected from the Munich Information Center for Protein Sequences (MIPS) CORUM database, which provides a resource of manually annotated protein complexes from mammalian organisms. The gene sets correspond to human protein complexes extracted from the CORUM database released on February 17, 2012. (2) 196 gene sets were collected from the Pathway Interaction Database (PID), which is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG) from 2006 until September 22nd, 2012, and is no longer being updated. In collaboration with the PID resource we extracted the gene sets from the PID data file (uniprot.tab) downloaded on May 15, 2012.

The sets in the CP (Canonical Pathways): the entire Reactome sub-collection was updated to version 44 of Reactome. Reactome is a curated knowledgebase of biological pathways in humans. This update created 399 new sets, leaving 275 sets from v3.0 MSigDB unchanged and 155 deprecated. No sets were renamed. Deprecated sets are listed here.

Updates to C4: Cancer Modules

The gene sets in C4: Cancer Modules represent clusters of transcriptionally co-regulated genes that both share a common functional annotation and have been found significantly deregulated in tumors. They correspond to the modules described in Segal et al., 2004. For the MSigDB v3.1 release, these gene sets were re-mapped to gene symbols from the Entrez Gene IDs as they appeared in original source files prior to v2.5. 23 sets were deprecated because they contained fewer than 10 genes. Names of all other sets were changed to upper case font to match the naming convention throughout MSigDB. Renamed and deprecated sets are listed here.

Updates to gene families

We fixed a discrepancy between the family of transcription factors and homeodomain proteins. All homeodomain proteins are transcription factors. However, due to differences in sources and compilation procedures, some homeodomain proteins were not present in the transcription factors gene family. This has been fixed in the 3.1 release.

Viewing previous versions of MSigDB

The MSigDB v3.0 and v2.5 files are archived and are available at Downloads page. You can view them through the MSigDB Browser tool in the GSEA desktop application. Please see GSEA 2.0.8 Release Notes for details.