Difference between revisions of "MSigDB v3.0 Release Notes"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
m
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">GSEA Home</a> | <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a>  | <a href="http://www.broadinstitute.org/gsea/msigdb/">Molecular Signatures Database</a> | Documentation | <a href="http://www.broadinstitute.org/gsea/contact.jsp">Contact</a><br />
+
[http://www.broadinstitute.org/gsea/ GSEA Home] |
 +
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |  
 +
[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |
 +
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
 +
[http://www.broadinstitute.org/gsea/contact.jsp Contact]<br />
 
<br />
 
<br />
 
Major changes in Release 3.0 of the Molecular Signatures Database (MSigDB) include the following:<br />
 
Major changes in Release 3.0 of the Molecular Signatures Database (MSigDB) include the following:<br />
 
<ul>
 
<ul>
    <li>removed all sets smaller than ten (C1, C2:CP, C3-C5) or five (C2:CGP) human gene symbols</li>
+
    <li>removed all gene sets with fewer than ten (C1, C2:CP, C3-C5) or five (C2:CGP) human gene symbols</li>
    <li>C2 collection: extensively reviewed and added many new gene sets as detailed below</li>
+
    <li>C2 collection: extensively reviewed and added many new gene sets as detailed below</li>
    <li>gene families: updated and added new gene families</li>
+
    <li>gene families: updated and added new gene families</li>
 
     <li>MSigDB gene sets now support human Entrez Gene IDs</li>
 
     <li>MSigDB gene sets now support human Entrez Gene IDs</li>
 
     <li>enhanced features in the MSigDB XML file format</li>
 
     <li>enhanced features in the MSigDB XML file format</li>
 
     <li>fixed a bug in the Compute Overlaps algorithm </li>
 
     <li>fixed a bug in the Compute Overlaps algorithm </li>
 
     <li>archived MSigDB v2.5 files</li>
 
     <li>archived MSigDB v2.5 files</li>
 +
    <li>changes in gene set names are documented [[Msigdb_mapping_v2.5_to_v3|here]]</li>
 
</ul>
 
</ul>
 
<h2>Gene Sets Update</h2>
 
<h2>Gene Sets Update</h2>
 
The following describes the changes made to the gene set collections for MSigDB v3.0. <br />
 
The following describes the changes made to the gene set collections for MSigDB v3.0. <br />
 
<h3>Size Filtering</h3>
 
<h3>Size Filtering</h3>
All collections have been filtered according to size in the following ways:<br />
+
<p>After mapping to human Entrez Gene IDs, sets with fewer than five genes were not included in the v3.0 release.</p>
<ul>
+
 
    <li>if a gene set was not in the C2:CGP subcollection, then it needed to have 10 or more human gene symbols associated with it to be included in the v3.0 release</li>
 
    <li>if a gene set was in the C2:CGP subcollection, then it needed to have 5 or more human gene symbols associated with it to be included in the v3.0 release</li>
 
</ul>
 
<!--EndFragment-->
 
 
<h3><font face="Arial">C1: Positional gene sets (-60)</font></h3>
 
<h3><font face="Arial">C1: Positional gene sets (-60)</font></h3>
 
<ul>
 
<ul>
<li>60 gene sets have been deprecated due to small size (less than 10 human gene symbols).</li>
+
    <li>60 gene sets have been deprecated due to small size (fewer than five human Entrez Gene IDs).</li>
 
</ul>
 
</ul>
 
No other changes were made in the C1 gene sets. For a description of this collection, see the  <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.&nbsp;
 
No other changes were made in the C1 gene sets. For a description of this collection, see the  <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.&nbsp;
Line 28: Line 29:
 
<h3>C2: Curated gene sets (+1,380)</h3>
 
<h3>C2: Curated gene sets (+1,380)</h3>
 
The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.&nbsp; Gene sets in this collection have been extensively revised and expanded.<br />
 
The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.&nbsp; Gene sets in this collection have been extensively revised and expanded.<br />
<br />
 
 
Note that all the gene set names for C2 have changed.&nbsp; Many of the names  used in v2.5 were confusing or wrong, so these have been clarified or  corrected.&nbsp; For CGP, the new naming convention is that all gene set  names begin with the surname of the first author of the source paper.&nbsp; For CP, the names now begin with the contributor organization.<br />
 
Note that all the gene set names for C2 have changed.&nbsp; Many of the names  used in v2.5 were confusing or wrong, so these have been clarified or  corrected.&nbsp; For CGP, the new naming convention is that all gene set  names begin with the surname of the first author of the source paper.&nbsp; For CP, the names now begin with the contributor organization.<br />
 
<ul>
 
<ul>
 
     <li><strong>CGP</strong>: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a>  for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release.&nbsp; All these gene sets have been verified against the original sources.&nbsp; During the reviewing process, we have:
 
     <li><strong>CGP</strong>: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a>  for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release.&nbsp; All these gene sets have been verified against the original sources.&nbsp; During the reviewing process, we have:
 
     <ul>
 
     <ul>
        <li>renamed gene sets to follow consistent conventions throughout the whole collection</li>
+
        <li>renamed gene sets to follow consistent conventions throughout the whole collection</li>
        <li>wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection</li>
+
        <li>wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection</li>
         <li>validated and corrected if necessary every attribute for each existing gene set</li>
+
         <li>validated and corrected, if necessary, every attribute for each existing gene set</li>
         <li>added exact source of the gene set (e.g., Table 1)</li>
+
         <li>added the exact source of the gene set (e.g., Table 1)</li>
 
         <li>added GEO or ArrayExpress ID when available</li>
 
         <li>added GEO or ArrayExpress ID when available</li>
         <li>changed the brief description of the gene set; added links to human Entrez Gene entries and PubChem Compound entries as appropriate</li>
+
         <li>added links to human Entrez Gene entries and PubChem Compound entries as appropriate</li>
 
         <li>used the original gene identifiers as reported in the source paper (not all gene sets did this originally)<br />
 
         <li>used the original gene identifiers as reported in the source paper (not all gene sets did this originally)<br />
 +
        </li>
 
         <li>resolved cases of redundant gene sets</li>
 
         <li>resolved cases of redundant gene sets</li>
 
     </ul>
 
     </ul>
 
     In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions.    </li>
 
     In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions.    </li>
     <li><strong>CP</strong>: canonical pathways (880 gene sets).</li>
+
     <li><strong>CP</strong>: canonical pathways (880 gene sets).
    <ul>
+
    <ul>
          <li>We have deprecated all gene sets:</li>
+
        <li>We have deprecated all gene sets:
          <ul>
+
        <ul>
          <li>from GenMAPP gene sets because the majority of them in the previous release are based on KEGG or GO information that we already have</li>
+
            <li>from GenMAPP gene sets because the majority of them in the previous release are based on KEGG or GO information that we already have</li>
          <li>from GO in this collection because they are already represented by C5</li>
+
            <li>from GO in this collection because they are already represented by C5</li>
          <li>based on NetAffx annotations because these are largely based on GO and thus are already represented by C5</li>
+
            <li>based on NetAffx annotations because these are largely based on GO and thus are already represented by C5</li>
          <li>with untraceable origins</li>
+
            <li>with untraceable origins</li>
    </ul>
+
        </ul>
          <li>We have replaced all existing BioCarta and KEGG gene sets with updated versions from these resources.</li>
+
        </li>
          <li>In collaboration with Reactome, we add 430 new canonical pathway sets</li>
+
        <li>We have replaced all existing BioCarta and KEGG gene sets with updated versions from these resources.</li>
          <li>To reduce redundancy in canonical pathways from BioCarta, KEGG and Reactome, we developed and applied the following filters:
+
        <li>In collaboration with Reactome, we added 430 new canonical pathway sets</li>
              <ul>
+
        <li>To reduce redundancy in canonical pathways from BioCarta, KEGG, and Reactome, we developed and applied the following filters:
              <li>Source priority: KEGG &gt; Reactome &gt; BioCarta</li>
+
        <ul>
              <li>Size priority: keep the set with the smaller size</li>
+
            <li>Source priority: KEGG &gt; Reactome &gt; BioCarta</li>
              <li>Name length priority: keep the set with the shorter name</li>
+
            <li>Size priority: keep the set with the smaller size</li>
              <li>External ID priority: keep the set with the smaller ID (applied to Reactome sets only)  </li>
+
            <li>Name length priority: keep the set with the shorter name</li>
              </ul>
+
            <li>External ID priority: keep the set with the smaller ID (applied to Reactome sets only)  </li>
          </li>
+
        </ul>
    <li>For convenience, we have organized gene sets from BioCarta, KEGG and Reactome as separated, third-level divisions within C2 CP</li>
+
        </li>
    </ul>
+
        <li>For convenience, we have organized gene sets from BioCarta, KEGG, and Reactome as separate, third-level divisions within C2:CP</li>
 +
    </ul>
 +
    </li>
 
</ul>
 
</ul>
 
 
<h3>C3: Motif gene sets (-1)</h3>
 
<h3>C3: Motif gene sets (-1)</h3>
 
<ul>
 
<ul>
<li>Thanks to a sharp user, we fixed an error in the description of the gene set &quot;V$NRF2_01&quot;.</li>
+
    <li>Thanks to a sharp user, we fixed an error in the description of the gene set &quot;V$NRF2_01&quot;.</li>
<li>All uncategorized gene sets in this collection have been assigned to the TFT subcollection.</li>
+
    <li>All uncategorized gene sets in this collection have been assigned to the TFT subcollection.</li>
<li>One gene set in the MIR subcollection has been deprecated due to small size (less than 10 human gene symbols).</li>
+
    <li>One gene set in the MIR subcollection has been deprecated due to small size (less than five human Entrez Gene IDs).</li>
 +
    <li>No other changes were made in the C3 gene sets.</li>
 
</ul>
 
</ul>
No other changes were made in the C3 gene sets.&nbsp; For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
+
<p>Gene sets in the C3 collection consist of genes sharing a cis-regulatory motif.  This collection contains the following two subcollections:</p>
 +
<h4>Transcription factor targets (TFT)</h4>
 +
<p>These sets share upstream cis-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.</p>
 +
<p>We extracted 460 mammalian transcriptional regulatory motifs from v7.4 [http://www.gene-regulation.com/ TRANSFAC] database.  We then generated the motif gene sets consisting of the inferred target genes for each motif.  Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif.  We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01.  The set’s full description is the TRANSFAC entry for the matching matrix, in a format described [http://www.gene-regulation.com/pub/databases/transfac/doc/matrix1SM.html here].</p>
 +
<p>Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif <strong>m</strong> of 174 upstream motifs highly conserved among five mammalian species. The motifs are catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. (2005, Nature 434, 338–345)] and represent potential transcription factor binding sites.  Each motif gene set consists of all human genes whose promoters (defined as regions -kb to +2kb around transcription start site) contained at least one conserved instance of motif <strong>m</strong>.  If the motif’s sequence matched a transcription factor binding site documented in the TRANSFAC database (see above), then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif m and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01).  The set’s full description in this case is the TRANSFAC entry for the matching matrix.  If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the sequence of motif <strong>m</strong>.</p>
 +
 
 +
<h4>microRNA Targets (MIR)</h4>
 +
<p>These gene sets consist of the inferred target gene for each motif <strong>m</strong> of 221 3'-UTR motifs highly conserved among five mammalian species.  The motifs are catalogued catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. 2005, Nature 434, 338–345] and represent potential microRNA binding sites. Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif <strong>m</strong>.</p>
  
 
<h3>C4: Computational gene sets (-2)</h3>
 
<h3>C4: Computational gene sets (-2)</h3>
 
<ul>
 
<ul>
<li>Two gene sets in the CM subcollection have been deprecated due to small size (less than 10 human gene symbols).</li>
+
    <li>Two gene sets in the CM subcollection have been deprecated due to small size (fewer than five human Entrez Gene IDs).</li>
 
</ul>
 
</ul>
 
No other changes were made in the C4 gene sets. For a description of this  collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
 
No other changes were made in the C4 gene sets. For a description of this  collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
 +
<h4>Cancer gene neighborhoods (CGN)</h4>
 +
<p>Starting with a curated list of 380 cancer-associated genes ([http://www.ncbi.nlm.nih.gov/pubmed/14593198 Brentani et al. 2003, Proc. Natl. Acad. Sci. USA 100, 13418-13423]), Subramanian, Tamayo et al. ([http://www.ncbi.nlm.nih.gov/pubmed/16199517 2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550]) mined four expression compendia for correlated gene sets. Gene neighborhoods with  fewer than 25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets. </p>
 +
<p>Gene set names indicate the corresponding expression compendia and the seed cancer-associated genes.</p>
 +
<li>GNF2: Novartis normal human tissue gene expression compendium ([http://www.ncbi.nlm.nih.gov/pubmed/15075390 Su, et al. 2004 Proc. Natl. Acad. Sci. USA 101, 6062-6067])
 +
<li>CAR: Novartis carcinoma gene expression compendium ([http://www.ncbi.nlm.nih.gov/pubmed/11606367 Su, et al. 2001 Cancer Res. 61, 7388-7393])
 +
<li>GCM: Global cancer map compendium ([http://www.ncbi.nlm.nih.gov/pubmed/11742071 Ramaswamy, et al. 2001 Proc. Natl. Acad. Sci. USA 98, 15149-15154])
 +
<li>MORF: A large internal compendium of gene expression data sets, including many of in-house Affymetrix U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc. ([http://www.ncbi.nlm.nih.gov/pubmed/16199517 Subramanian, Tamayo et al. 2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550])
  
 
<h3>C5: Gene Ontology gene sets </h3>
 
<h3>C5: Gene Ontology gene sets </h3>
 
<ul>
 
<ul>
<li>
+
    <li> Names of 71 gene sets have been changed by removing pairs of consecutive underscore characters ('_'). </li>
Names of 71 gene sets have been changed by removing pairs of consecutive underscore characters ('_').
 
</li>
 
 
</ul>
 
</ul>
No other changes were made in the C5 gene sets. For a description of this  collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
+
No other changes were made in the C5 gene sets. For a description of this  collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.<br />
 
 
 
<h3>For more information</h3>
 
<h3>For more information</h3>
 
For complete descriptions of all collections or to download the updated  gene sets, go to the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.
 
For complete descriptions of all collections or to download the updated  gene sets, go to the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse  Collections</a> page.
  
 
<h2>Other Updates</h2>
 
<h2>Other Updates</h2>
<h3>XML Format Changes</h3>
+
<h3>Changes in organism annotation</h3>
The XML format and tags have changed.&nbsp; See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">this page</a> for more information.<br />
+
We applied consistent organism names throughout MSigDB gene sets. For example, 'Human' was renamed to 'Homo sapiens'. Pathways reported as 'Human' which, according to the original records, are not present in 'Homo sapiens' were deprecated. We inspected annotations of every gene set and replaced 'Generic' organism name with the appropriate correct organism name. For genes in the C3 collection, we replaced 'Human,Mouse,Rat,Dog' organism name with the more correct 'Homo sapiens' name. While the majority of motifs themselves are conserved across the five species, human genome sequences were used to identify genes containing these and other motifs.
<h3>Entrez IDs Now Supported</h3>
+
<h3>Changes in the database XML file format</h3>
All gene sets now have Entrez IDs as well as human gene symbols, and alternate GMT files are included on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a>  page.&nbsp; In addition, we have added a new CHIP file that maps Entrez IDs to human gene symbols.&nbsp; Therefore, data files analyzed in GSEA can now use Entrez IDs.<br />
+
We have added new attributes to the XML database file.&nbsp; See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">this page</a> for more information.<br />
 +
 
 +
<h3>More options with GMT files</h3>
 +
Traditionally, all gene sets in MSigDB use human gene symbols as standard gene identifiers. GMT files with gene sets based on human gene symbols contain the word <strong>symbols</strong> in their names. For standard GSEA analysis, no change is expected: just continue using <strong>*.symbols.gmt</strong> files as previously. For the first time, this release contains additional kinds of GMT files, with gene sets using:<br>
 +
<ul>
 +
<li>original gene identifiers: exactly as they appear in the sources of gene sets; these files contain the word <strong>orig</strong> in their names<br>
 +
      Since original identifiers are from a variety of platforms, we do not recommend using them for routine GSEA analysis. Instead, these files should serve for uses other than standard GSEA analysis.
 +
</li>
 +
<li>human Entrez Gene IDs; these files contain the word <strong>entrez</strong> in their names<br>
 +
        Because these files are not identical to the standard <strong>*.symbols.gmt</strong> files, GSEA done with these files might produce slightly different results compared to the standard <strong>*.symbols.gmt</strong> files.
 +
</li>
 +
</ul>
 +
 
 +
<p>
 +
For standard GSEA analysis, no change is expected: just continue using .<strong>*.symbols.gmt</strong> files as previously. Alternatively, you can now choose GMT files with genes rendered as human Entrez Gene IDs.
 +
</p>
 +
 
 
<h3>Compute Overlaps Error Corrected</h3>
 
<h3>Compute Overlaps Error Corrected</h3>
A user-reported bug in the Compute Overlaps algorithm has been corrected, improving the quality of the <em>P</em> values.<br />
+
Several users noted an error in calculations of p-values in the Compute Overlaps tool at the MSigDB web site. Thanks to these reports, we have corrected the error.<br />
 +
 
 
<h3>MSigDB v2.5 Files</h3>
 
<h3>MSigDB v2.5 Files</h3>
The MSigDB v2.5 files are archived and are still available for download on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a>  page<br />
+
The MSigDB v2.5 files are archived and are still available for download on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a>  page.&nbsp; To load them into GSEA, see the [[GSEA_v2.0.7_Release_Notes|GSEA 2.0.7 Release Notes]].<br />
 
<p> </p>
 
<p> </p>

Latest revision as of 02:10, 25 September 2016

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Major changes in Release 3.0 of the Molecular Signatures Database (MSigDB) include the following:

  • removed all gene sets with fewer than ten (C1, C2:CP, C3-C5) or five (C2:CGP) human gene symbols
  • C2 collection: extensively reviewed and added many new gene sets as detailed below
  • gene families: updated and added new gene families
  • MSigDB gene sets now support human Entrez Gene IDs
  • enhanced features in the MSigDB XML file format
  • fixed a bug in the Compute Overlaps algorithm
  • archived MSigDB v2.5 files
  • changes in gene set names are documented here

Gene Sets Update

The following describes the changes made to the gene set collections for MSigDB v3.0.

Size Filtering

After mapping to human Entrez Gene IDs, sets with fewer than five genes were not included in the v3.0 release.

C1: Positional gene sets (-60)

  • 60 gene sets have been deprecated due to small size (fewer than five human Entrez Gene IDs).

No other changes were made in the C1 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page. 

C2: Curated gene sets (+1,380)

The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts.  Gene sets in this collection have been extensively revised and expanded.
Note that all the gene set names for C2 have changed.  Many of the names used in v2.5 were confusing or wrong, so these have been clarified or corrected.  For CGP, the new naming convention is that all gene set names begin with the surname of the first author of the source paper.  For CP, the names now begin with the contributor organization.

  • CGP: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a> for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release.  All these gene sets have been verified against the original sources.  During the reviewing process, we have:
    • renamed gene sets to follow consistent conventions throughout the whole collection
    • wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection
    • validated and corrected, if necessary, every attribute for each existing gene set
    • added the exact source of the gene set (e.g., Table 1)
    • added GEO or ArrayExpress ID when available
    • added links to human Entrez Gene entries and PubChem Compound entries as appropriate
    • used the original gene identifiers as reported in the source paper (not all gene sets did this originally)
    • resolved cases of redundant gene sets
    In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions.
  • CP: canonical pathways (880 gene sets).
    • We have deprecated all gene sets:
      • from GenMAPP gene sets because the majority of them in the previous release are based on KEGG or GO information that we already have
      • from GO in this collection because they are already represented by C5
      • based on NetAffx annotations because these are largely based on GO and thus are already represented by C5
      • with untraceable origins
    • We have replaced all existing BioCarta and KEGG gene sets with updated versions from these resources.
    • In collaboration with Reactome, we added 430 new canonical pathway sets
    • To reduce redundancy in canonical pathways from BioCarta, KEGG, and Reactome, we developed and applied the following filters:
      • Source priority: KEGG > Reactome > BioCarta
      • Size priority: keep the set with the smaller size
      • Name length priority: keep the set with the shorter name
      • External ID priority: keep the set with the smaller ID (applied to Reactome sets only)
    • For convenience, we have organized gene sets from BioCarta, KEGG, and Reactome as separate, third-level divisions within C2:CP

C3: Motif gene sets (-1)

  • Thanks to a sharp user, we fixed an error in the description of the gene set "V$NRF2_01".
  • All uncategorized gene sets in this collection have been assigned to the TFT subcollection.
  • One gene set in the MIR subcollection has been deprecated due to small size (less than five human Entrez Gene IDs).
  • No other changes were made in the C3 gene sets.

Gene sets in the C3 collection consist of genes sharing a cis-regulatory motif. This collection contains the following two subcollections:

Transcription factor targets (TFT)

These sets share upstream cis-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.

We extracted 460 mammalian transcriptional regulatory motifs from v7.4 TRANSFAC database. We then generated the motif gene sets consisting of the inferred target genes for each motif. Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif. We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01. The set’s full description is the TRANSFAC entry for the matching matrix, in a format described here.

Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif m of 174 upstream motifs highly conserved among five mammalian species. The motifs are catalogued in Xie, et al. (2005, Nature 434, 338–345) and represent potential transcription factor binding sites. Each motif gene set consists of all human genes whose promoters (defined as regions -kb to +2kb around transcription start site) contained at least one conserved instance of motif m. If the motif’s sequence matched a transcription factor binding site documented in the TRANSFAC database (see above), then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif m and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01). The set’s full description in this case is the TRANSFAC entry for the matching matrix. If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the sequence of motif m.

microRNA Targets (MIR)

These gene sets consist of the inferred target gene for each motif m of 221 3'-UTR motifs highly conserved among five mammalian species. The motifs are catalogued catalogued in Xie, et al. 2005, Nature 434, 338–345 and represent potential microRNA binding sites. Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif m.

C4: Computational gene sets (-2)

  • Two gene sets in the CM subcollection have been deprecated due to small size (fewer than five human Entrez Gene IDs).

No other changes were made in the C4 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.

Cancer gene neighborhoods (CGN)

Starting with a curated list of 380 cancer-associated genes (Brentani et al. 2003, Proc. Natl. Acad. Sci. USA 100, 13418-13423), Subramanian, Tamayo et al. (2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550) mined four expression compendia for correlated gene sets. Gene neighborhoods with fewer than 25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets.

Gene set names indicate the corresponding expression compendia and the seed cancer-associated genes.

  • GNF2: Novartis normal human tissue gene expression compendium (Su, et al. 2004 Proc. Natl. Acad. Sci. USA 101, 6062-6067)
  • CAR: Novartis carcinoma gene expression compendium (Su, et al. 2001 Cancer Res. 61, 7388-7393)
  • GCM: Global cancer map compendium (Ramaswamy, et al. 2001 Proc. Natl. Acad. Sci. USA 98, 15149-15154)
  • MORF: A large internal compendium of gene expression data sets, including many of in-house Affymetrix U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc. (Subramanian, Tamayo et al. 2005 Proc. Natl. Acad. Sci. USA 102, 15545-15550)

    C5: Gene Ontology gene sets

    • Names of 71 gene sets have been changed by removing pairs of consecutive underscore characters ('_').

    No other changes were made in the C5 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.

    For more information

    For complete descriptions of all collections or to download the updated gene sets, go to the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.

    Other Updates

    Changes in organism annotation

    We applied consistent organism names throughout MSigDB gene sets. For example, 'Human' was renamed to 'Homo sapiens'. Pathways reported as 'Human' which, according to the original records, are not present in 'Homo sapiens' were deprecated. We inspected annotations of every gene set and replaced 'Generic' organism name with the appropriate correct organism name. For genes in the C3 collection, we replaced 'Human,Mouse,Rat,Dog' organism name with the more correct 'Homo sapiens' name. While the majority of motifs themselves are conserved across the five species, human genome sequences were used to identify genes containing these and other motifs.

    Changes in the database XML file format

    We have added new attributes to the XML database file.  See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">this page</a> for more information.

    More options with GMT files

    Traditionally, all gene sets in MSigDB use human gene symbols as standard gene identifiers. GMT files with gene sets based on human gene symbols contain the word symbols in their names. For standard GSEA analysis, no change is expected: just continue using *.symbols.gmt files as previously. For the first time, this release contains additional kinds of GMT files, with gene sets using:

    • original gene identifiers: exactly as they appear in the sources of gene sets; these files contain the word orig in their names
      Since original identifiers are from a variety of platforms, we do not recommend using them for routine GSEA analysis. Instead, these files should serve for uses other than standard GSEA analysis.
    • human Entrez Gene IDs; these files contain the word entrez in their names
      Because these files are not identical to the standard *.symbols.gmt files, GSEA done with these files might produce slightly different results compared to the standard *.symbols.gmt files.

    For standard GSEA analysis, no change is expected: just continue using .*.symbols.gmt files as previously. Alternatively, you can now choose GMT files with genes rendered as human Entrez Gene IDs.

    Compute Overlaps Error Corrected

    Several users noted an error in calculations of p-values in the Compute Overlaps tool at the MSigDB web site. Thanks to these reports, we have corrected the error.

    MSigDB v2.5 Files

    The MSigDB v2.5 files are archived and are still available for download on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> page.  To load them into GSEA, see the GSEA 2.0.7 Release Notes.