MSigDB v7.0 Release Notes

From GeneSetEnrichmentAnalysisWiki
Revision as of 15:04, 2 August 2019 by Acastanza (talk | contribs)
Jump to navigation Jump to search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This page describes the changes made to the gene set collections for Release 7.0 of the Molecular Signatures Database (MSigDB). This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, corrections to miscellaneous errors, an overhaul of several collections/subcollections.

Changes to MSigDB Gene Symbol Mapping Procedure

Overhaul of externally sourced collections

Overhaul of C1 collection - positional gene sets

Overhaul of C2:CP:Reactome collection - Reactome gene sets

  • Reactome: Reactome V69 (+825 gene sets).

    In order to limit redundancy between gene sets within the Reactome subcollection we applied a filtering procedure based on Jaccard's coefficients and distance from the hierarchy root previously described for limitation of redundancy within the C5:GO collections (see also, below).

    Revisions to C2:CGP collection

    Depreciated Gene Sets (net: -131 gene sets). Previous releases of MSigDB contained gene sets derived from this resource that were founder sets for one of more gene sets in the MSigDB Hallmark collection. These deprecated C2 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.

    • Gene sets derived from the Signal Transduction Knowledge Environment have been removed from MSigDB (-27 gene sets). The underlying data for this resource is no longer available in such a way that the collection could be reliably maintained. Previous releases of MSigDB contained gene sets derived from this resource that were founder sets for one of more gene sets in the MSigDB Hallmark collection. These deprecated C2 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.
    • Gene sets with gene annotations derived from UniGene cluster identifiers have been retired and are no longer present in MSigDB 7.0 (-140 gene sets). The UniGene database has been retired by NCBI as of July, 2019. This change affects only gene sets where UniGene cluster identifiers were present in the gene set's original ids annotation.

    Overhaul of C5 collection - gene ontology

    Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations.

    This collection is divided into three subcollections:

    • CC: GO Cellular component (+421 gene sets). Gene sets derived from the Cellular Component Ontology.
    • MF: GO Molecular function (+744 gene sets). Gene sets derived from the Molecular Function Ontology.
    • BP: GO Biological process (+2914 gene sets). Gene sets derived from the Biological Process Ontology.

    Outline of the procedure:

    All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.

    The input files are:

    • gene2go (downloaded on February 21, 2019)
    • This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.

    • go-basic.obo(downloaded on February 21, 2019)
    • This file contains the entire GO ontology in OBO v.1.2 format.

    This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 5 or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard's coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard's coefficients was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferably keeping the more general set (i.e., the set closest to the root of the GO ontology tree).

    A previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These deprecated C5 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.