Difference between revisions of "MSigDB v7.0 Release Notes"

Latest revision as of 16:47, 30 March 2020

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This page describes the changes made to the gene set collections for Release 7.0 of the Molecular Signatures Database (MSigDB). This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, overhaul of several collections/sub-collections, and corrections to miscellaneous errors.

Note: Due to substantial changes in MSigDB, it is recommended that users migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+ resources.
Advisory: It is strongly recommended that users of MSigDB7/GSEA4.0 always use the GSEA "Collapse dataset to gene symbols" feature with the provided Symbol Remapping chip file if your dataset was generated with a transcriptome other than Ensembl 97/GENCODE 31.

Changes to MSigDB Gene Symbol Mapping Procedures

Now using Ensembl as the platform annotation authority

Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being excluded from some gene sets and GSEA analyses due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.

Gene annotations supplied in the MSigDB 7.0 release are derived from Ensembl version 97 corresponding to GENCODE release 31 and reflect the HGNC Gene Symbols as of the GENCODE 31 freeze date of February 2019.

Change to gene orthology mapping procedure for non-human gene sets

Mouse and Rat genes were assigned to their corresponding Human orthologues using the gene orthologies provided in Ensembl BioMart for Ensembl version 97.
As many Mouse and Rat genes correspond to many possible Human orthologues of various fidelity, a ranking procedure was utilized to match each respective non-human gene to its best orthologue match. Genes were ranked by their dS/dN score, their averaged reciprocal percent identicality, their Human Gene-order conservation score, and their Human Whole-genome alignment coverage. These metrics identify likely best orthologues using a combination of gene coding sequence conservation, gene non-coding sequence conservation, and genomic architecture conservation.

CHIP file updates

A new Gene Symbol CHIP file for the GSEA "Collapse dataset" feature will be supplied in order to facilitate remapping data sets which use gene annotations prior to the Ensembl release 97/GENCODE release 31 namespace used in MSigDB 7.0 in to this space for GSEA.
New CHIP files have been provided to enable the use of data sets containing Mouse/Rat gene symbols directly through the use of the GSEA "Collapse dataset" feature. These annotations are derived from Ensembl 97's Mouse and Rat databases respectively, and support experiments from pipelines relying on GENCODE annotations up to GENCODE release M22 (Mouse). Mappings to orthologous Human genes were derived by the procedure described above.
Previous symbols and aliases for each current gene were provided by their respective symbol authorities (e.g. HGNC for Human, MGI for Mouse, and RGD for Rat).
Previous NCBI IDs for all genes were extracted from the NCBI gene_history file available from the NCBI FTP server.
Several CHIP files annotating platforms which are not included in Ensembl's BioMart database have been depreciated.
Annotations for all platforms represented in Ensembl's BioMart database have been updated to reflect the Ensembl version 97 annotations.

Changes to data set handling recommendations

Due to substantial changes in MSigDB, it is recommended that users migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+ resources.
It is strongly recommended that users of MSigDB7/GSEA4.0 always use the GSEA "Collapse dataset to gene symbols" feature with following the provided Symbol Remapping chip files even if your experiment is already in the gene symbols namespace as this will ensure that your gene symbols are matched to those used in MSigDB7.
- Human_Symbol_with_Remapping_MSigDB.v7.0.chip
- Mouse_Gene_Symbol_Remapping_MSigDB.v7.0.chip
- Rat_Gene_Symbol_Remapping_MSigDB.v7.0.chip
This remapping is not necessary if your data set was generated using Ensembl 97 or GENCODE 31 transcriptomes. This is a change from our previous recommendation.

Global Change to MSigDB Gene Set Inclusion Criteria

As of MSigDB 7.0 the minimum size threshold for inclusion of a gene set in an MSigDB collection has been reduced to 5 unique gene symbols. This global filter threshold was previously set at 10 unique symbols. This change primarily affects gene sets in the C5:G0 and C2:CP:Reactome collections. This does not affect the default thresholds in the GSEA application.

Updates to Gene Sets by Collection

C1 (positional gene sets) - Major overhaul

C1 has been rebuilt to reflect the primary assembly of the current release of the Human Genome as present in Ensembl 97 and GENCODE 31 (GRCh38.p12). Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from the Ensembl BioMart (version 97) and reflect the gene architecture as represented on the primary assembly. This resulted in a small reduction in the number of gene sets (-27), as sets representing complete chromosome arms with few annotated genes were removed.

C2:CP:Reactome - Major overhaul

Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of Reactome v69 (+825 gene sets).
In order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard coefficients and distance from the top level of the Reactome event hierarchy. This is similar to the procedure applied in the C5 (Gene Ontology) collection (see below). Briefly, we computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to one of the 28 pathways at the top level of the Reactome Event Hierarchy).

C2:CP:BioCarta - Content revision

Pathways curated from BioCarta have been revised to reflect the final versions available of the Human BioCarta pathways as represented on the NCI CGAP website. This resulted in an overall increase of +72 gene sets. Gene set names were also revised as a result of this change and several gene sets were removed including:

BIOCARTA_CHREBP2_PATHWAY was renamed to BIOCARTA_CHREBP_PATHWAY.
BIOCARTA_FEEDER_PATHWAY was removed.
BIOCARTA_KREB_PATHWAY was removed.
BIOCARTA_NEUROTRANSMITTERS_PATHWAY was removed.
BIOCARTA_PROTEASOME_PATHWAY was removed.

Additionally, missing genes from the BIOCARTA_STATHMIN_PATHWAY have been corrected.

C2:CP:PID - New sub-collection heading

Gene sets from the Pathway Interaction Database have been given a top-level sub-collection heading (PID) within C2:CP.

C2:CGP - Miscellaneous corrections to curated gene sets

The names of the gene sets ERB2_UP.V1_UP/DN have been corrected to: ERBB2_UP.V1_UP/DN to accurately reflect the gene symbol.
The gene set LEI_MYB_TARGETS was annotated as originating from the HG-U133A microarray platform. The correct platform is: HG_U95Av2. This has been corrected.
The gene sets OISHI_CHOLANGIOMA_STEM_CELL_LIKE_UP/DN were annotated as originating from the HuGene-1_0_st microarray platform. The correct platform is: Affymetrix HG U133 Plus 2.0. This has been corrected.
16 of the 21 gene sets derived from PubMed ID: 18509334, Authors: Mikkelsen TS, et al. were incorrectly annotated as being derived from human data. The originating species was, in fact, Mus musculus. This has been corrected.
The gene sets CHEMELLO_SOLEUS_VS_EDL_MYOFIBERS_UP/DN had been assigned an incorrect PubMed ID. The correct PMID: 21364935 has been assigned.
The original data source annotation for the gene sets HAN_SATB1_TARGETS_UP/DN had been inadvertently switched. HAN_SATB1_TARGETS_UP now correctly refers to Supplementary Table 3-b, and HAN_SATB1_TARGETS_DN now correctly refers to Supplementary Table 3-c, of the original source publication.
Four gene sets were incorrectly attributed to PubMed ID: 17906691, Author: Mantovani G., et al.: MANTOVANI_NFKB_TARGETS_UP, MANTOVANI_NFKB_TARGETS_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_UP these gene sets have been renamed and reassigned to reflect the correct PMID and author. PMID: 17934524, Author: Martin D., et al. The gene set names have been edited to reflect this correction. See: MARTIN_NFKB_TARGETS_UP, MARTIN_NFKB_TARGETS_DN, MARTIN_VIRAL_GPCR_SIGNALING_DN, MARTIN_VIRAL_GPCR_SIGNALING_UP
Due to an error, the members were incorrect in 17 of the GARGALOVIC sets (PubMed ID: 16912112). Likewise, +7 new gene sets that were missed earlier have been added.

C2:CGP - Miscellaneous deprecated sets removed

Gene sets derived from the Signal Transduction Knowledge Environment have been removed from MSigDB (-27 gene sets). The underlying data for this resource is no longer available in such a way that the collection could be reliably maintained.
Gene sets with gene annotations derived from UniGene cluster identifiers have been retired and are no longer present in MSigDB 7.0 (-139 gene sets). The UniGene database has been retired by NCBI as of July 2019. This change affects only gene sets where UniGene cluster identifiers were present in the gene set's original ids annotation. The full list of affected gene sets is given in Appendix 1.
Some of the above deprecated gene sets were founder sets for one of more gene sets in the MSigDB Hallmark collection. These deprecated C2 sets are included in MSigDB 7.0 in an ARCHIVED collection, in order to preserve links to their pages from the hallmark gene set pages.

C5 (Gene Ontology collection) - Major overhaul

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations (based on downloads from GO on February 21, 2019).

This collection is divided into three sub-collections:

CC: GO Cellular component (+421 gene sets). Gene sets derived from the Cellular Component Ontology.
MF: GO Molecular function (+744 gene sets). Gene sets derived from the Molecular Function Ontology.
BP: GO Biological process (+2914 gene sets). Gene sets derived from the Biological Process Ontology.

Outline of the procedure:

All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.

The input files are:

gene2go (downloaded on February 21, 2019)

This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.

go-basic.obo (downloaded on February 21, 2019)

This file contains the entire GO ontology in OBO v.1.2 format.

This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 5 or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).

A previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These deprecated C5 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.

C6 (Oncogenic signatures) - Miscellaneous corrections

An error was identified with the gene sets PIGF_UP.V1.UP and PIGF_UP.V1.DN. The original publication utilized an alias for the Placental Growth Factor gene which included a lowercase L (PlGF). This underwent a curation error when the sets were added to MSigDB which resulted in the conversion of the lowercase L to a capital I. This error in turn resulted in the incorrect annotation of the sets with the NCBI Gene ID for Phosphatidylinositol glycan anchor biosynthesis class F (Gene Symbol: PIDF, NCBI Gene ID: 5281). The gene sets have been corrected to PGF_UP.V1.UP and PGF_UP.V1.DN respectively, and correctly linked to NCBI Gene ID: 5228.
Errors in the metadata for the gene set NFE2L2.V2 were corrected.
This gene set had been incorrectly annotated as a signature of genes up-regulated in response to knockout of the nuclear factor NRF2. This gene set properly represents the signature of genes down-regulated upon NFE2L2.V2 knockout and has been corrected to reflect this.
Additionally, this gene set had been miss-attributed to Malhotra et al., PubMed ID 20460467, the correct publication of Kim et al., PubMed ID: 27088724 has been assigned.

Appendix 1: UniGene Derived Gene Sets Removed from C2:CGP

ACEVEDO_FGFR1_TARGETS_IN_PROSTATE_CANCER_MODEL_DN
ACEVEDO_FGFR1_TARGETS_IN_PROSTATE_CANCER_MODEL_UP
ALONSO_METASTASIS_DN
ALONSO_METASTASIS_UP
AMUNDSON_GAMMA_RADIATION_RESISTANCE
AMUNDSON_GAMMA_RADIATION_RESPONSE
AMUNDSON_GENOTOXIC_SIGNATURE
AMUNDSON_POOR_SURVIVAL_AFTER_GAMMA_RADIATION_2G
AMUNDSON_POOR_SURVIVAL_AFTER_GAMMA_RADIATION_8G
BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_DN
BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP
BUDHU_LIVER_CANCER_METASTASIS_DN
BUDHU_LIVER_CANCER_METASTASIS_UP
CHANG_CORE_SERUM_RESPONSE_DN
CHANG_CORE_SERUM_RESPONSE_UP
CHANG_CYCLING_GENES
CHAN_INTERFERON_PRODUCING_DENDRITIC_CELL
CSR_EARLY_UP
CSR_LATE_UP
DAIRKEE_TERT_TARGETS_DN
DAIRKEE_TERT_TARGETS_UP
DARWICHE_PAPILLOMA_RISK_HIGH_VS_LOW_DN
DARWICHE_PAPILLOMA_RISK_HIGH_VS_LOW_UP
FALVELLA_SMOKERS_WITH_LUNG_CANCER
GEORGANTAS_HSC_MARKERS
GRADE_COLON_CANCER_DN
GRADE_COLON_CANCER_UP
GRADE_METASTASIS_DN
HAMAI_APOPTOSIS_VIA_TRAIL_DN
HEDENFALK_BREAST_CANCER_BRCA1_VS_BRCA2
HEDENFALK_BREAST_CANCER_HEREDITARY_VS_SPORADIC
JIANG_HYPOXIA_CANCER
JIANG_HYPOXIA_NORMAL
JIANG_HYPOXIA_VIA_VHL
JIANG_VHL_TARGETS
JUBAN_TARGETS_OF_SPI1_AND_FLI1_DN
JUBAN_TARGETS_OF_SPI1_AND_FLI1_UP
KANG_GIST_WITH_PDGFRA_DN
KANG_GIST_WITH_PDGFRA_UP
KEEN_RESPONSE_TO_ROSIGLITAZONE_DN
KEEN_RESPONSE_TO_ROSIGLITAZONE_UP
KUROKAWA_LIVER_CANCER_CHEMOTHERAPY_DN
KUROKAWA_LIVER_CANCER_CHEMOTHERAPY_UP
LABBE_TARGETS_OF_TGFB1_AND_WNT3A_DN
LABBE_TARGETS_OF_TGFB1_AND_WNT3A_UP
LABBE_TGFB1_TARGETS_DN
LABBE_TGFB1_TARGETS_UP
LABBE_WNT3A_TARGETS_DN
LABBE_WNT3A_TARGETS_UP
LEE_LIVER_CANCER
MAHADEVAN_RESPONSE_TO_MP470_DN
MAHADEVAN_RESPONSE_TO_MP470_UP
MA_PITUITARY_FETAL_VS_ADULT_DN
MA_PITUITARY_FETAL_VS_ADULT_UP
MEINHOLD_OVARIAN_CANCER_LOW_GRADE_DN
MEINHOLD_OVARIAN_CANCER_LOW_GRADE_UP
MENSSEN_MYC_TARGETS
NAKAYAMA_FGF2_TARGETS
NELSON_RESPONSE_TO_ANDROGEN_DN
NELSON_RESPONSE_TO_ANDROGEN_UP
NIELSEN_GIST
NIELSEN_GIST_AND_SYNOVIAL_SARCOMA_DN
NIELSEN_GIST_AND_SYNOVIAL_SARCOMA_UP
NIELSEN_GIST_VS_SYNOVIAL_SARCOMA_DN
NIELSEN_GIST_VS_SYNOVIAL_SARCOMA_UP
NIELSEN_LEIOMYOSARCOMA_CNN1_DN
NIELSEN_LEIOMYOSARCOMA_CNN1_UP
NIELSEN_LEIOMYOSARCOMA_DN
NIELSEN_LEIOMYOSARCOMA_UP
NIELSEN_LIPOSARCOMA_DN NIELSEN_LIPOSARCOMA_UP
NIELSEN_MALIGNAT_FIBROUS_HISTIOCYTOMA_DN
NIELSEN_MALIGNAT_FIBROUS_HISTIOCYTOMA_UP
NIELSEN_SCHWANNOMA_DN
NIELSEN_SCHWANNOMA_UP
NIELSEN_SYNOVIAL_SARCOMA_DN
NIELSEN_SYNOVIAL_SARCOMA_UP NING_CHRONIC_OBSTRUCTIVE_PULMONARY_DISEASE_DN
NING_CHRONIC_OBSTRUCTIVE_PULMONARY_DISEASE_UP
OKAMOTO_LIVER_CANCER_MULTICENTRIC_OCCURRENCE_DN
OKAMOTO_LIVER_CANCER_MULTICENTRIC_OCCURRENCE_UP
OXFORD_RALA_AND_RALB_TARGETS_DN
OXFORD_RALA_AND_RALB_TARGETS_UP
OXFORD_RALA_OR_RALB_TARGETS_DN
OXFORD_RALA_OR_RALB_TARGETS_UP
OXFORD_RALA_TARGETS_DN
OXFORD_RALA_TARGETS_UP
OXFORD_RALB_TARGETS_DN
OXFORD_RALB_TARGETS_UP
PASTURAL_RIZ1_TARGETS_UP
PATTERSON_DOCETAXEL_RESISTANCE
PENG_GLUCOSE_DEPRIVATION_DN
PENG_GLUCOSE_DEPRIVATION_UP
PENG_GLUTAMINE_DEPRIVATION_DN
PENG_GLUTAMINE_DEPRIVATION_UP
PENG_LEUCINE_DEPRIVATION_DN
PENG_LEUCINE_DEPRIVATION_UP
PENG_RAPAMYCIN_RESPONSE_DN
PENG_RAPAMYCIN_RESPONSE_UP
ROME_INSULIN_TARGETS_IN_MUSCLE_DN
ROME_INSULIN_TARGETS_IN_MUSCLE_UP
SCHRAMM_INHBA_TARGETS_DN
SCHRAMM_INHBA_TARGETS_UP
SCHURINGA_STAT5A_TARGETS_DN
SCHURINGA_STAT5A_TARGETS_UP
SEIKE_LUNG_CANCER_POOR_SURVIVAL
SHEPARD_BMYB_MORPHOLINO_DN
SHEPARD_BMYB_MORPHOLINO_UP
SHEPARD_CRUSH_AND_BURN_MUTANT_UP
TRACEY_RESISTANCE_TO_IFNA2_DN
TRACEY_RESISTANCE_TO_IFNA2_UP
WANG_RESPONSE_TO_PACLITAXEL_VIA_MAPK8_DN
WANG_RESPONSE_TO_PACLITAXEL_VIA_MAPK8_UP
WATTEL_AUTONOMOUS_THYROID_ADENOMA_DN
WATTEL_AUTONOMOUS_THYROID_ADENOMA_UP
WONG_IFNA2_RESISTANCE_DN
WONG_IFNA2_RESISTANCE_UP
YANG_MUC2_TARGETS_COLON_3MO_DN
YANG_MUC2_TARGETS_DUODENUM_3MO_DN
YANG_MUC2_TARGETS_DUODENUM_6MO_DN
YANG_MUC2_TARGETS_DUODENUM_6MO_UP
YE_METASTATIC_LIVER_CANCER
YOSHIOKA_LIVER_CANCER_EARLY_RECURRENCE_DN
YOSHIOKA_LIVER_CANCER_EARLY_RECURRENCE_UP
ZEMBUTSU_SENSITIVITY_TO_CISPLATIN
ZEMBUTSU_SENSITIVITY_TO_CYCLOPHOSPHAMIDE
ZEMBUTSU_SENSITIVITY_TO_DOXORUBICIN
ZEMBUTSU_SENSITIVITY_TO_FLUOROURACIL
ZEMBUTSU_SENSITIVITY_TO_METHOTREXATE
ZEMBUTSU_SENSITIVITY_TO_MITOMYCIN
ZEMBUTSU_SENSITIVITY_TO_NIMUSTINE
ZEMBUTSU_SENSITIVITY_TO_VINBLASTINE
ZEMBUTSU_SENSITIVITY_TO_VINCRISTINE
ZHANG_TLX_TARGETS_36HR_DN
ZHANG_TLX_TARGETS_36HR_UP
ZHANG_TLX_TARGETS_60HR_DN
ZHANG_TLX_TARGETS_60HR_UP
ZHANG_TLX_TARGETS_DN
ZHANG_TLX_TARGETS_UP
ZUCCHI_METASTASIS_DN
ZUCCHI_METASTASIS_UP

@@ Line 7: / Line 7: @@
 </span>
-This page describes the changes made to the gene set collections for Release 7.0 of the Molecular Signatures Database (MSigDB). This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, overhaul of several collections/sub-collections, and corrections to miscellaneous errors .
+This page describes the changes made to the gene set collections for Release 7.0 of the Molecular Signatures Database (MSigDB). This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, overhaul of several collections/sub-collections, and corrections to miscellaneous errors.
+<b>Note:</b> Due to substantial changes in MSigDB, it is recommended that users migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+ resources.<br>
+<b>Advisory</b>: It is strongly recommended that users of MSigDB7/GSEA4.0 '''always''' use the GSEA "Collapse dataset to gene symbols" feature with the provided Symbol Remapping chip file if your dataset was generated with a transcriptome other than Ensembl 97/GENCODE 31.
-<b>Note:</b> Due to substantial changes in MSigDB, it is recommended that users migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+ resources.
 <h2>Changes to MSigDB Gene Symbol Mapping Procedures</h2>
-<h3>Now using ENSEMBL as the platform annotation authority</h3>
+<h3>Now using Ensembl as the platform annotation authority</h3>
-Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from ENSEMBL's BioMart data service, and will be updated at each MSigDB release with the latest available version of ENSEMBL. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being excluded from some gene sets and GSEA analyses due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.
+Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being excluded from some gene sets and GSEA analyses due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.
 <ul>
-     <li>Gene annotations supplied in the MSigDB 7.0 release are derived from '''<span class="plainlinks">[http://jul2019.archive.ensembl.org/index.html ENSEMBL version 97]</span>''' corresponding to '''Gencode release 31''' and reflect the HGNC Gene Symbols as of the Gencode 31 freeze date of February 2019.</li>
+     <li>Gene annotations supplied in the MSigDB 7.0 release are derived from '''<span class="plainlinks">[http://jul2019.archive.ensembl.org/index.html Ensembl version 97]</span>''' corresponding to '''GENCODE release 31''' and reflect the HGNC Gene Symbols as of the GENCODE 31 freeze date of February 2019.</li>
 </ul>
 <h3>Change to gene orthology mapping procedure for non-human gene sets</h3>
 <ul>
-     <li>Mouse and Rat genes were assigned to their corresponding Human orthologues using the gene orthologies provided in ENSEMBL BioMart for ENSEMBL version 97.</li>
+     <li>Mouse and Rat genes were assigned to their corresponding Human orthologues using the gene orthologies provided in Ensembl BioMart for Ensembl version 97.</li>
      <li>As many Mouse and Rat genes correspond to many possible Human orthologues of various fidelity, a ranking procedure was utilized to match each respective non-human gene to its best orthologue match. Genes were ranked by their dS/dN score, their averaged reciprocal percent identicality, their Human Gene-order conservation score, and their Human Whole-genome alignment coverage. These metrics identify likely best orthologues using a combination of gene coding sequence conservation, gene non-coding sequence conservation, and genomic architecture conservation.</li>
 </ul>
@@ Line 25: / Line 27: @@
 <h3>CHIP file updates</h3>
 <ul>
-     <li>A new Gene Symbol CHIP file for the GSEA "Collapse dataset" feature will be supplied in order to facilitate remapping data sets which use gene annotations prior to the ENSEMBL release 97/Gencode release 31 namespace used in MSigDB 7.0 in to this space for GSEA.</li>
+     <li>A new Gene Symbol CHIP file for the GSEA "Collapse dataset" feature will be supplied in order to facilitate remapping data sets which use gene annotations prior to the Ensembl release 97/GENCODE release 31 namespace used in MSigDB 7.0 in to this space for GSEA.</li>
-     <li>New CHIP files have been provided to enable the use of data sets containing Mouse/Rat gene symbols directly through the use of the GSEA "Collapse dataset" feature. These annotations are derived from ENSEMBL 97's Mouse and Rat databases respectively, and support experiments from pipelines relying on Gencode annotations up to Gencode release M22 (Mouse). Mappings to orthologous Human genes were derived by the procedure described above.</li>
+     <li>New CHIP files have been provided to enable the use of data sets containing Mouse/Rat gene symbols directly through the use of the GSEA "Collapse dataset" feature. These annotations are derived from Ensembl 97's Mouse and Rat databases respectively, and support experiments from pipelines relying on GENCODE annotations up to GENCODE release M22 (Mouse). Mappings to orthologous Human genes were derived by the procedure described above.</li>
      <li>Previous symbols and aliases for each current gene were provided by their respective symbol authorities (e.g. HGNC for Human, MGI for Mouse, and RGD for Rat).</li>
      <li>Previous NCBI IDs for all genes were extracted from the NCBI gene_history file available from the NCBI FTP server.</li>
-     <li>Several CHIP files annotating platforms which are not included in ENSEMBL's BioMart database have been depreciated.</li>
+     <li>Several CHIP files annotating platforms which are not included in Ensembl's BioMart database have been depreciated.</li>
-     <li>Annotations for all platforms represented in ENSEMBL's BioMart database have been updated to reflect the ENSEMBL version 97 annotations.</li>
+     <li>Annotations for all platforms represented in Ensembl's BioMart database have been updated to reflect the Ensembl version 97 annotations.</li>
+</ul>
+<h3>Changes to data set handling recommendations</h3>
+<ul>
+    <li>Due to substantial changes in MSigDB, it is recommended that users '''migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+''' resources.</li>
+    <li>It is strongly recommended that users of MSigDB7/GSEA4.0 '''always''' use the GSEA "Collapse dataset to gene symbols" feature with following the provided Symbol Remapping chip files '''even if your experiment is already in the gene symbols namespace''' as this will ensure that your gene symbols are matched to those used in MSigDB7.
+<ul>
+   <li>Human_Symbol_with_Remapping_MSigDB.v7.0.chip</li>
+   <li>Mouse_Gene_Symbol_Remapping_MSigDB.v7.0.chip</li>
+   <li>Rat_Gene_Symbol_Remapping_MSigDB.v7.0.chip</li>
+</ul>
+This remapping is not necessary if your data set was generated using Ensembl 97 or GENCODE 31 transcriptomes. This is a change from our previous recommendation.</li>
 </ul>
@@ Line 39: / Line 52: @@
 <h3>C1 (positional gene sets) - Major overhaul </h3>
-C1 has been rebuilt to reflect the primary assembly of the current release of the Human Genome as present in ENSEMBL 97 and Gencode 31 (GRCh38.p12). Gene annotations for this collection are derived from the <tt>Chromosome</tt> and <tt>Karyotype band</tt> tracks from the ENSEMBL BioMart (version 97) and reflect the gene architecture as represented on the primary assembly. This resulted in a small reduction in the number of gene sets (-27), as sets representing complete chromosome arms with few annotated genes were removed.
+C1 has been rebuilt to reflect the primary assembly of the current release of the Human Genome as present in Ensembl 97 and GENCODE 31 (GRCh38.p12). Gene annotations for this collection are derived from the ''Chromosome'' and ''Karyotype band'' tracks from the Ensembl BioMart (version 97) and reflect the gene architecture as represented on the primary assembly. This resulted in a small reduction in the number of gene sets (-27), as sets representing complete chromosome arms with few annotated genes were removed.
 <h3>C2:CP:Reactome - Major overhaul </h3>
 <ul>
      <li>Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of '''Reactome v69''' (+825 gene sets).
-<li>In order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on <tt>Jaccard's coefficients</tt> and distance from the top level of the Reactome event hierarchy. This is similar to the procedure applied in the C5 (Gene Ontology) collection (see below). Briefly, we computed <tt>Jaccard's coefficients</tt> for each pair of sets, and marked a pair as highly similar if its <tt>Jaccard's coefficient</tt> was greater than 0.85. We then clustered highly similar sets into "chunks" using the <tt>hclust</tt> function from the R <tt>stats</tt> package and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to one of the 28 pathways at the top level of the Reactome Event Hierarchy).</p>
+<li>In order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard coefficients and distance from the top level of the Reactome event hierarchy. This is similar to the procedure applied in the C5 (Gene Ontology) collection (see below). Briefly, we computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the <tt>hclust</tt> function from the R <tt>stats</tt> package and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to one of the 28 pathways at the top level of the Reactome Event Hierarchy).
 </ul>
@@ Line 50: / Line 63: @@
 Pathways curated from BioCarta have been revised to reflect the final versions available of the Human BioCarta pathways as represented on the <span class="plainlinks">[https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways NCI CGAP website].</span> This resulted in an overall increase of +72 gene sets. Gene set names were also revised as a result of this change and several gene sets were removed including:
 <ul>
-     <li><tt>BIOCARTA_CHREBP2_PATHWAY</tt> was renamed to <tt>BIOCARTA_CHREBP_PATHWAY</tt>.</li>
+     <li>''BIOCARTA_CHREBP2_PATHWAY'' was renamed to ''BIOCARTA_CHREBP_PATHWAY''.</li>
-     <li><tt>BIOCARTA_FEEDER_PATHWAY</tt> was removed.</li>
+     <li>''BIOCARTA_FEEDER_PATHWAY'' was removed.</li>
-     <li><tt>BIOCARTA_KREB_PATHWAY</tt> was removed.</li>
+     <li>''BIOCARTA_KREB_PATHWAY'' was removed.</li>
-     <li><tt>BIOCARTA_NEUROTRANSMITTERS_PATHWAY</tt> was removed.</li>
+     <li>''BIOCARTA_NEUROTRANSMITTERS_PATHWAY'' was removed.</li>
-     <li><tt>BIOCARTA_PROTEASOME_PATHWAY</tt> was removed.</li>
+     <li>''BIOCARTA_PROTEASOME_PATHWAY'' was removed.</li>
 </ul>
-Additionally, missing genes from the <tt>BIOCARTA_STATHMIN_PATHWAY</tt> have been corrected.
+Additionally, missing genes from the ''BIOCARTA_STATHMIN_PATHWAY'' have been corrected.
 <h3>C2:CP:PID - New sub-collection heading</h3>
@@ Line 63: / Line 76: @@
 <h3>C2:CGP - Miscellaneous corrections to curated gene sets</h3>
 <ul>
-     <li>The names of the gene sets <tt>ERB2_UP.V1_UP/DN</tt> have been corrected to: <tt>ERBB2_UP.V1_UP/DN</tt> to accurately reflect the gene symbol.</li>
+     <li>The names of the gene sets ''ERB2_UP.V1_UP/DN'' have been corrected to: ''ERBB2_UP.V1_UP/DN'' to accurately reflect the gene symbol.</li>
-     <li>The gene set <tt>LEI_MYB_TARGETS</tt> was annotated as originating from the <tt>HG-U133A</tt> microarray platform. The correct platform is: <tt>HG_U95Av2</tt>. This has been corrected.</li>
+     <li>The gene set ''LEI_MYB_TARGETS'' was annotated as originating from the HG-U133A microarray platform. The correct platform is: HG_U95Av2. This has been corrected.</li>
-     <li>The gene sets <tt>OISHI_CHOLANGIOMA_STEM_CELL_LIKE_UP/DN</tt> were annotated as originating from the <tt>HuGene-1_0_st</tt> microarray platform. The correct platform is: <tt>Affymetrix HG U133 Plus 2.0</tt>. This has been corrected.</li>
+     <li>The gene sets ''OISHI_CHOLANGIOMA_STEM_CELL_LIKE_UP/DN'' were annotated as originating from the HuGene-1_0_st microarray platform. The correct platform is: Affymetrix HG U133 Plus 2.0. This has been corrected.</li>
      <li>16 of the 21 gene sets derived from PubMed ID: 18509334, Authors: Mikkelsen TS, et al. were incorrectly annotated as being derived from human data. The originating species was, in fact, Mus musculus. This has been corrected.</li>
-     <li>The gene sets <tt>CHEMELLO_SOLEUS_VS_EDL_MYOFIBERS_UP/DN</tt> had been assigned an incorrect PubMed ID. The correct PMID: 21364935 has been assigned.</li>
+     <li>The gene sets ''CHEMELLO_SOLEUS_VS_EDL_MYOFIBERS_UP/DN'' had been assigned an incorrect PubMed ID. The correct PMID: 21364935 has been assigned.</li>
-     <li>The original data source annotation for the gene sets <tt>HAN_SATB1_TARGETS_UP/DN</tt> had been inadvertently switched. <tt>HAN_SATB1_TARGETS_UP</tt> now correctly refers to Supplementary Table 3-b, and <tt>HAN_SATB1_TARGETS_DN</tt> now correctly refers to Supplementary Table 3-c, of the original source publication.</li>
+     <li>The original data source annotation for the gene sets ''HAN_SATB1_TARGETS_UP/DN'' had been inadvertently switched. ''HAN_SATB1_TARGETS_UP'' now correctly refers to Supplementary Table 3-b, and ''HAN_SATB1_TARGETS_DN'' now correctly refers to Supplementary Table 3-c, of the original source publication.</li>
-     <li>Four gene sets were incorrectly attributed to PubMed ID: 17906691, Author: Mantovani G., et al.: <tt>MANTOVANI_NFKB_TARGETS_UP, MANTOVANI_NFKB_TARGETS_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_UP</tt> these gene sets have been renamed and reassigned to reflect the correct PMID and author. PMID: 17934524, Author: Martin D., et al. The gene set names have been edited to reflect this correction. See: <tt>MARTIN_NFKB_TARGETS_UP, MARTIN_NFKB_TARGETS_DN, MARTIN_VIRAL_GPCR_SIGNALING_DN, MARTIN_VIRAL_GPCR_SIGNALING_UP</tt> </li>
+     <li>Four gene sets were incorrectly attributed to PubMed ID: 17906691, Author: Mantovani G., et al.: ''MANTOVANI_NFKB_TARGETS_UP, MANTOVANI_NFKB_TARGETS_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_DN, MANTOVANI_VIRAL_GPCR_SIGNALING_UP'' these gene sets have been renamed and reassigned to reflect the correct PMID and author. PMID: 17934524, Author: Martin D., et al. The gene set names have been edited to reflect this correction. See: ''MARTIN_NFKB_TARGETS_UP, MARTIN_NFKB_TARGETS_DN, MARTIN_VIRAL_GPCR_SIGNALING_DN, MARTIN_VIRAL_GPCR_SIGNALING_UP'' </li>
+    <li>Due to an error, the members were incorrect in 17 of the ''GARGALOVIC'' sets (PubMed ID: 16912112).  Likewise, +7 new gene sets that were missed earlier have been added.</li>
 </ul>
@@ Line 92: / Line 106: @@
 <ul>
 <li><b>gene2go</b> (downloaded on February 21, 2019)</li>
-<p>This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the [http://www.geneontology.org/GO.current.annotations.shtml GO FTP site] and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in <tt>gene_info.gz</tt>. The file is available <span class="plainlinks">[ftp://ftp.ncbi.nih.gov/gene/DATA/ here]</span>. It is a tab delimited plain text file with one <tt>tax_id / gene_id / evidence_code</tt> per line.</p>
+<p>This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the [http://www.geneontology.org/GO.current.annotations.shtml GO FTP site] and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in ''gene_info.gz''. The file is available <span class="plainlinks">[ftp://ftp.ncbi.nih.gov/gene/DATA/ here]</span>. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.</p>
 <li><b>go-basic.obo</b> (downloaded on February 21, 2019)</li>
 <p>This file contains the entire GO ontology in <span class="plainlinks">[http://owlcollab.github.io/oboformat/doc/GO.format.obo-1_2.html OBO v.1.2 format].</span></p>
 </ul>
-<p>This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than <b>5</b> or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed <tt>Jaccard's coefficients</tt> for each pair of sets, and marked a pair as highly similar if its <tt>Jaccard's coefficient</tt> was greater than 0.85. We then clustered highly similar sets into "chunks" using the <tt>hclust</tt> function from the R <tt>stats</tt> package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).</p>
+<p>This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than <b>5</b> or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the <tt>hclust</tt> function from the R <tt>stats</tt> package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).</p>
 <p>A previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These deprecated C5 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.</p>
 <h3>C6 (Oncogenic signatures) - Miscellaneous corrections </h3>
 <ul>
-<li>An error was identified with the gene sets <tt>PIGF_UP.V1.UP</tt> and <tt>PIGF_UP.V1.DN</tt>. The original publication utilized an alias for the Placental Growth Factor gene which included a lowercase L (PlGF). This underwent a curation error when the sets were added to MSigDB which resulted in the conversion of the lowercase L to a capitol I. This error in turn resulted in the incorrect annotation of the sets with the NCBI Gene ID for Phosphatidylinositol glycan anchor biosynthesis class F (Gene Symbol: PIDF, NCBI Gene ID: 5281).
+<li>An error was identified with the gene sets ''PIGF_UP.V1.UP'' and ''PIGF_UP.V1.DN''. The original publication utilized an alias for the Placental Growth Factor gene which included a lowercase L (PlGF). This underwent a curation error when the sets were added to MSigDB which resulted in the conversion of the lowercase L to a capital I. This error in turn resulted in the incorrect annotation of the sets with the NCBI Gene ID for Phosphatidylinositol glycan anchor biosynthesis class F (Gene Symbol: PIDF, NCBI Gene ID: 5281).
-The gene sets have been corrected to <tt>PGF_UP.V1.UP</tt> and <tt>PGF_UP.V1.DN</tt> respectively, and correctly linked to NCBI Gene ID: 5228. </li>
+The gene sets have been corrected to ''PGF_UP.V1.UP'' and ''PGF_UP.V1.DN'' respectively, and correctly linked to NCBI Gene ID: 5228. </li>
-<li>Errors in the metadata for the gene set <tt>NFE2L2.V2</tt> were corrected. <br> This gene set had been incorrectly annotated as a signature of genes up-regulated in response to knockout of the nuclear factor NRF2. This gene set properly represents the signature of genes ''down''-regulated upon NFE2L2.V2 knockout and has been corrected to reflect this. <br> Additionally, this gene set had been miss-attributed to Malhotra et al., PubMed ID 20460467, the correct publication of Kim et al., PubMed ID: 27088724 has been assigned.</li>
+<li>Errors in the metadata for the gene set ''NFE2L2.V2'' were corrected. <br> This gene set had been incorrectly annotated as a signature of genes up-regulated in response to knockout of the nuclear factor NRF2. This gene set properly represents the signature of genes ''down''-regulated upon NFE2L2.V2 knockout and has been corrected to reflect this. <br> Additionally, this gene set had been miss-attributed to Malhotra et al., PubMed ID 20460467, the correct publication of Kim et al., PubMed ID: 27088724 has been assigned.</li>
 </ul>

Difference between revisions of "MSigDB v7.0 Release Notes"

Latest revision as of 16:47, 30 March 2020

Contents

Changes to MSigDB Gene Symbol Mapping Procedures

Now using Ensembl as the platform annotation authority

Change to gene orthology mapping procedure for non-human gene sets

CHIP file updates

Changes to data set handling recommendations

Global Change to MSigDB Gene Set Inclusion Criteria

Updates to Gene Sets by Collection

C1 (positional gene sets) - Major overhaul

C2:CP:Reactome - Major overhaul

C2:CP:BioCarta - Content revision

C2:CP:PID - New sub-collection heading

C2:CGP - Miscellaneous corrections to curated gene sets

C2:CGP - Miscellaneous deprecated sets removed

C5 (Gene Ontology collection) - Major overhaul

C6 (Oncogenic signatures) - Miscellaneous corrections

Appendix 1: UniGene Derived Gene Sets Removed from C2:CGP

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

MSigDB

Software

Internal only

Tools