MSigDB v7.0 Release Notes
This page describes the changes made to the gene set collections for Release 7.0 of the Molecular Signatures Database (MSigDB). This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, corrections to miscellaneous errors, an overhaul of several collections/sub-collections.
- 1 Changes to MSigDB Gene Symbol Mapping Procedures
- 2 Global Change to MSigDB Gene Set Inclusion Criteria
- 3 Overhaul of externally sourced collections
- 3.1 Overhaul of positional gene sets collection (C1)
- 3.2 Overhaul of the Reactome gene sets collection (C2:CP:Reactome)
- 3.3 Other Revisions to the C2 collection
- 3.4 Overhaul of gene ontology collection (C5)
- 4 Appendix 1: UniGene Derived Gene Sets Removed from C2:CGP
Changes to MSigDB Gene Symbol Mapping Procedures
Beginning in MSigDB 7.0, identifiers for genes will be mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from the then-current release of ENSEMBL's BioMart data service.
- Gene annotations supplied in the MSigDB 7.0 release are derived from ENSEMBL version 97 corresponding to Gencode release 31 and reflect the HGNC Gene Symbols as of the Gencode 31 freeze date of February 2019. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being exluded from some gene sets and GSEA runs due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.
- A new Gene Symbol CHIP file for the GSEA "Collapse dataset" feature will be supplied in order to facilitate remapping data sets which use gene annotations prior to the ENSEMBL release 97/Gencode release 31 namespace used in MSigDB 7.0 in to this space for GSEA.
- New CHIP files have been provided to enable the use of data sets containing Mouse/Rat gene symbols directly through the use of the GSEA "Collapse dataset" feature. These annotaitons are derived from ENSEMBL 97's Mouse and Rat databases respectively. Mappings to orthologous Human genes were derived from the procedure described below.
- Previous symbols and aliases for each current gene were provided by their respective symbol authorities, eg. HGNC for Human, MGI for Mouse, and RGD for Rat.
- Previous NCBI IDs for all genes were extracted from the NCBI gene_history file available from the NCBI FTP server.
- Several CHIP files annotating platforms which are not included in ENSEMBL's BioMart database have been depreciated.
- Annotations for all platforms represented in ENSEMBL's BioMart database have been updated to reflect the ENSEMBL version 97 annotations.
Change to gene orthology mapping procedure for non-human gene sets
- Mouse and Rat genes were assigned to their cooresponding Human orthologues using the gene othrologies provided in ENSEMBL BioMart for ENSEMBL version 97.
- As many Mouse and Rat genes coorespond to many possible Human orthologues of various fidelity, a ranking procedure was utilized to match each respective non-human gene to it's best orthologue match. Genes were ranked by their dS/dN score, their averaged reciprocal percent identicality, their Human Gene-order conservation score, and their Human Whole-genome alignment coverage. These metrics identify likely best orthologues using a combination of gene coding sequence conservation, gene non-coding sequence conservation, and genomic architecture conservation.
Global Change to MSigDB Gene Set Inclusion Criteria
As of MSigDB 7.0 the minimum size threshold for inclusion of a gene set in an MSigDB collection has been reduced to 5 unique gene symbols. This global filter threshold was previously set at 10 unique symbols. This change primarily affects gene sets in the C5:G0 and C2:CP:Reactome collections. This does not affect the default thresholds in the GSEA application.
Overhaul of externally sourced collections
Overhaul of positional gene sets collection (C1)
C1 has been rebuilt to reflect the primary assembly of the current release of the Human Genome as present in ENSEMBL 97 and Gencode 31 (GRCh38.p12). Gene annotations for this collection are derived from ENSEMBL97 and reflect the gene architecture as represented on the primary assembly. This resulted in a small reduction in the number of gene sets (-27), as sets representing complete chromosome arms with few annotated genes were removed.
Overhaul of the Reactome gene sets collection (C2:CP:Reactome)
- Reactome: Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of Reactome v69 (+825 gene sets).
In order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard's coefficients and distance from the hierarchy root previously described for limitation of redundancy within the C5:GO collections (see also, below).
Other Revisions to the C2 collection
Depreciated Gene Sets from C2:CGP
Previous releases of MSigDB contained gene sets derived from this resource that were founder sets for one of more gene sets in the MSigDB Hallmark collection. These deprecated C2 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.
- Gene sets derived from the Signal Transduction Knowledge Environment have been removed from MSigDB (-27 gene sets). The underlying data for this resource is no longer available in such a way that the collection could be reliably maintained. Previous releases of MSigDB contained gene sets derived from this resource that were founder sets for one of more gene sets in the MSigDB Hallmark collection. These deprecated C2 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.
- Gene sets with gene annotations derived from UniGene cluster identifiers have been retired and are no longer present in MSigDB 7.0 (-139 gene sets). The UniGene database has been retired by NCBI as of July, 2019. This change affects only gene sets where UniGene cluster identifiers were present in the gene set's original ids annotation. The full list of affected gene sets is given in Appendix 1.
Revisions to the C2:CP:BioCarta collection
Pathways curated from BioCarta have been revised to reflect the final versions available of the Human BioCarta pathways as represented on theThis resulted in an overall increase of +72 gene sets. Gene set names were also revised as a result of this change and several gene sets were removed including:
- BIOCARTA_CHREBP2_PATHWAY was renamed to BIOCARTA_CHREBP_PATHWAY.
- BIOCARTA_FEEDER_PATHWAY was removed.
- BIOCARTA_KREB_PATHWAY was removed.
- BIOCARTA_NEUROTRANSMITTERS_PATHWAY was removed.
- BIOCARTA_PROTEASOME_PATHWAY was removed.
Additionally, missing genes from the BIOCARTA_STATHMIN_PATHWAY have been corrected.
Miscellaneous corrections to curated gene sets (C2:CGP)
- The names of the gene sets ERB2_UP.V1_UP/DN have been corrected to: ERBB2_UP.V1_UP/DN to accurately reflect the gene symbol.
- The gene set LEI_MYB_TARGETS was annotated as originating from the HG-U133A microarray platform. The correct platform is: HG_U95Av2. This has been corrected.
- The gene sets OISHI_CHOLANGIOMA_STEM_CELL_LIKE_UP/DN were annotated as originating from the HuGene-1_0_st microarray platform. The correct platform is: Affymetrix U133 Plus 2.0. This has been corrected.
- 16 of the 21 gene sets derived from Pubmed 18509334, Authors: Mikkelsen TS, et al. were incorrectly annotated as being derived from human data. The originating species was, in fact, Mus musculus. This has been corrected.
- The gene sets CHEMELLO_SOLEUS_VS_EDL_MYOFIBERS_UP/DN had been assigned an incorrect PubMed ID. The correct PMID: 21364935 has been assigned.
- The original data source annotation for the gene sets HAN_SATB1_TARGETS_UP/DN had been inadvertently switched. HAN_SATB1_TARGETS_UP now correctly refers to Supplementary Table 3-b, and HAN_SATB1_TARGETS_DN now correctly refers to Supplementary Table 3-c, of the original source publication.
- Four gene sets were incorrectly attributed Pubmed 17906691, Author: Mantovani G., et al.: MANTOVANI_NFKB_TARGETS_UP, MANTOVANI_NFKB_TARGETS_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_UP these gene sets have been renamed and reassigned to refect the correct PMID and author. PMID: 17934524, Author: Martin D., et al. The gene set names have been edited to reflect this correction. See: MARTIN_NFKB_TARGETS_UP, MARTIN_NFKB_TARGETS_DN, MARTIN_VIRAL_GPCR_SIGNALING_DN, MARTIN_VIRAL_GPCR_SIGNALING_UP
Overhaul of gene ontology collection (C5)
Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations.
This collection is divided into three sub-collections:
- CC: GO Cellular component (+421 gene sets). Gene sets derived from the Cellular Component Ontology.
- MF: GO Molecular function (+744 gene sets). Gene sets derived from the Molecular Function Ontology.
- BP: GO Biological process (+2914 gene sets). Gene sets derived from the Biological Process Ontology.
Outline of the procedure:
All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.
The input files are:
- gene2go (downloaded on February 21, 2019)
- go-basic.obo(downloaded on February 21, 2019)
This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available . It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.
This file contains the entire GO ontology in
This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 5 or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard's coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard's coefficients was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferably keeping the more general set (i.e., the set closest to the root of the GO ontology tree).
A previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These deprecated C5 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.
Appendix 1: UniGene Derived Gene Sets Removed from C2:CGP
- NIELSEN_LIPOSARCOMA_DN NIELSEN_LIPOSARCOMA_UP
- NIELSEN_SYNOVIAL_SARCOMA_UP NING_CHRONIC_OBSTRUCTIVE_PULMONARY_DISEASE_DN