Difference between revisions of "MSigDB v3.0 Release Notes"
m |
m |
||
Line 31: | Line 31: | ||
Note that all the gene set names for C2 have changed. Many of the names used in v2.5 were confusing or wrong, so these have been clarified or corrected. For CGP, the new naming convention is that all gene set names begin with the surname of the first author of the source paper. For CP, the names now begin with the contributor organization.<br /> | Note that all the gene set names for C2 have changed. Many of the names used in v2.5 were confusing or wrong, so these have been clarified or corrected. For CGP, the new naming convention is that all gene set names begin with the surname of the first author of the source paper. For CP, the names now begin with the contributor organization.<br /> | ||
<ul> | <ul> | ||
− | <li><strong>CGP</strong>: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a> for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release. All these gene sets have been | + | <li><strong>CGP</strong>: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a> for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release. All these gene sets have been verified against the original sources. During the reviewing process, we have: |
<ul> | <ul> | ||
− | <li>added exact source of the gene set | + | <li>renamed gene sets to follow consistent conventions throughout the whole collection</li> |
+ | <li>wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection</li> | ||
+ | <li>validated and corrected if necessary every attribute for each existing gene set</li> | ||
+ | <li>added exact source of the gene set (e.g., Table 1)</li> | ||
<li>added GEO or ArrayExpress ID when available</li> | <li>added GEO or ArrayExpress ID when available</li> | ||
− | <li>changed the brief description of the gene set; added links to human Entrez Gene entries and PubChem Compound entries as appropriate | + | <li>changed the brief description of the gene set; added links to human Entrez Gene entries and PubChem Compound entries as appropriate</li> |
− | |||
<li>used the original gene identifiers as reported in the source paper (not all gene sets did this originally)<br /> | <li>used the original gene identifiers as reported in the source paper (not all gene sets did this originally)<br /> | ||
− | < | + | <li>resolved cases of redundant gene sets</li> |
− | |||
</ul> | </ul> | ||
In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions. </li> | In addition, we made an aggressive effort to identify new gene sets and add them to the database, using the same stringent set of criteria for reviewing these new additions. </li> | ||
Line 51: | Line 52: | ||
There are 3 third-level subcollections for C2CP: Biocarta (217 gene sets), KEGG (186 gene sets), and Reactome (430 gene sets) </li> | There are 3 third-level subcollections for C2CP: Biocarta (217 gene sets), KEGG (186 gene sets), and Reactome (430 gene sets) </li> | ||
</ul> | </ul> | ||
+ | |||
<h3>C3: Motif gene sets (-1)</h3> | <h3>C3: Motif gene sets (-1)</h3> | ||
<ul> | <ul> |
Revision as of 17:43, 11 September 2010
<a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">GSEA Home</a> | <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> | <a href="http://www.broadinstitute.org/gsea/msigdb/">Molecular Signatures Database</a> | Documentation | <a href="http://www.broadinstitute.org/gsea/contact.jsp">Contact</a>
Major changes in Release 3.0 of the Molecular Signatures Database (MSigDB) include the following:
- We have excluded sets with less than ten (C1, C2:CP, C3-C5) or five (C2:CGP) human gene symbols.
- We have extensively reviewed and expanded the C2 collection and added new gene sets as detailed below.
- We have updated and expanded gene families.
- MSigDB gene sets now support human Entrez Gene IDs.
- We have modified the MSigDB XML file format to support human Entrez Gene IDs.
- A bug in the Compute Overlaps algorithm has been fixed.
- MSigDB v2.5 files archived
Contents
Gene Sets Update
The following describes the changes made to the gene set collections for MSigDB v3.0.
Size Filtering
All collections have been filtered according to size in the following ways:
- if a gene set was not in the C2:CGP subcollection, then it needed to have 10 or more human gene symbols associated with it to be included in the v3.0 release
- if a gene set was in the C2:CGP subcollection, then it needed to have 5 or more human gene symbols associated with it to be included in the v3.0 release
C1: Positional gene sets (-61)
- 61 gene sets in the CM subcollection have been deprecated due to small size (less than 10 human gene symbols).
No other changes were made in the C1 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.
C2: Curated gene sets (+1,380)
The C2 collection consists of gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. Gene sets in this collection have been extensively revised and expanded.
Note that all the gene set names for C2 have changed. Many of the names used in v2.5 were confusing or wrong, so these have been clarified or corrected. For CGP, the new naming convention is that all gene set names begin with the surname of the first author of the source paper. For CP, the names now begin with the contributor organization.
- CGP: chemical and genetic perturbations (2,392 gene sets). See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Msigdb_mapping_v2.5_to_v3">this page</a> for information about MSigDB 2.5 gene sets that have been renamed, retired, recombined, or replaced in the MSigDB 3.0 release. All these gene sets have been verified against the original sources. During the reviewing process, we have:
- renamed gene sets to follow consistent conventions throughout the whole collection
- wrote new, enhanced, brief descriptions according to consistent conventions throughout the whole collection
- validated and corrected if necessary every attribute for each existing gene set
- added exact source of the gene set (e.g., Table 1)
- added GEO or ArrayExpress ID when available
- changed the brief description of the gene set; added links to human Entrez Gene entries and PubChem Compound entries as appropriate
- used the original gene identifiers as reported in the source paper (not all gene sets did this originally)
- resolved cases of redundant gene sets
- CP: canonical pathways (880 gene sets). We have replaced all gene sets in this collection with the most up-to-date sets from BioCarta, KEGG, and Reactome, deprecating redundant sets and consolidating the updated versions. We retrieved human pathways from the KEGG and BioCarta websites, and Reactome contributed their pathways in collaboration with MSigDB. All sets from KEGG and BioCarta have been replaced with newer versions. The Reactome sets were prepared from the v33 Reactome summer 2010 release. We applied the following filters to this data:
- Source priority: KEGG > Reactome > BioCarta
- Size priority: keep the set with the smaller size
- Name length priority: keep the set with the shorter name
- External ID priority: keep the set with the smaller ID
C3: Motif gene sets (-1)
- Thanks to a sharp user, we fixed an error in the description of the gene set "V$NRF2_01".
- All uncategorized gene sets in this collection have been assigned to the TFT subcollection.
- One gene set in the MIR subcollection has been deprecated due to small size (less than 10 human gene symbols).
No other changes were made in the C3 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.
C4: Computational gene sets (-2)
- Two gene sets in the CM subcollection have been deprecated due to small size (less than 10 human gene symbols).
No other changes were made in the C4 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.
C5: Gene Ontology gene sets
- Names of 71 gene sets have been changed by removing pairs of consecutive underscore characters ('_').
No other changes were made in the C5 gene sets. For a description of this collection, see the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.
For more information
For complete descriptions of all collections or to download the updated gene sets, go to the <a href="http://www.broad.mit.edu/gsea/msigdb/collections.jsp">Browse Collections</a> page.
Other Updates
XML Format Changes
The XML format and tags have changed. See <a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description">this page</a> for more information.
Entrez IDs Now Supported
All gene sets now have Entrez IDs as well as human gene symbols, and alternate GMT files are included on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> page. In addition, we have added a new CHIP file that maps Entrez IDs to human gene symbols. Therefore, data files analyzed in GSEA can now use Entrez IDs.
Compute Overlaps Error Corrected
A user-reported bug in the Compute Overlaps algorithm has been corrected, improving the quality of the P values.
MSigDB v2.5 Files
The MSigDB v2.5 files are archived and are still available for download on the <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> page