MSigDB v5.2 Release Notes

From GeneSetEnrichmentAnalysisWiki

Revision as of 21:49, 5 October 2016 by Liberzon (Talk | contribs)
Jump to: navigation, search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This page describes the changes made to the gene set collections for Release 5.2 of the Molecular Signatures Database (MSigDB).


Updates to C2:CGP - curated gene sets

We added 4 gene sets (submitted by Dr. Eppert, MgGill University) and redefined the gene set [[1]

Overhaul of C5 - gene ontology

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. We have deprecated the entire collection and define new gene sets using more recent GO term annotations.

This collection is divided into three subcollections:

  • CC: GO Cellular component (+584 gene sets). Gene sets derived from the Cellular Component Ontology.
  • MF: GO Molecular function (+929 gene sets). Gene sets derived from the Molecular Function Ontology.
  • BP: GO Biological process (+4653 gene sets). Gene sets derived from the Biological Process Ontology.
Outline of the procedure

All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.

The input files are:

  • gene2go (downloaded on May 3, 2016)
  • This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.

  • go-basic.obo(downloaded on May 3, 2016)
  • This file contains the entire GO ontology in OBO v.1.2 format [2]. The file is produced by Gene Ontology Consortium and is updated every 30 minutes [3]. Monthly releases are also available [4]. OBO is the plain text file format used by OBO-Edit, the open source, platform-independent application for viewing and editing ontologies.

First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 10 Entrez Gene IDs. Next we removed very large sets for extremely broad GO terms by manually inspecting large sets corresponding to GO terms at depths 2 and 3. Finally, we resolve redundancies by applying the following rules:

  • a set equals parental GO term set: keep the parent, discard the child GO term
  • a set equals many parental GO term sets: keep the child, discard the parent GO terms
  • equal sets are siblings: discard both
  • equal sets do not have a child - parent relationship; discard both

Among the deprecated C5 gene sets, 864 were founders of hallmark signatures. We designated these vintage founder sets as ARCHIVED hidden collection and included in v5.2 MSigDB in order to preserve links to their pages from hallmark pages.

Updates to C6 collection

Corrected typing errors in the names of two gene sets.

Updates to C7 collection

Corrected typing errors in the names of 332 gene sets and fixed other errors in gene set annotations.

For more information

For complete descriptions of all collections or to download the updated gene sets, go to the Browse Collections page. Detailed information about v5.1 MSigDB gene sets that have been renamed, deprecated or archived in the 5.2 release can be found here

Personal tools