![]() |
MSigDB Collections: Details and AcknowledgmentsH collection: Hallmark gene setsWe envision this collection as the starting point for your exploration of the MSigDB resource and GSEA. Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. We refer to the original overlapping gene sets, from which a hallmark is derived, as its 'founder' sets. Hallmark gene set pages provide links to the corresponding founder sets for deeper follow up.This collection is an initial release of 50 hallmarks which condense information from over 4,000 original overlapping gene sets from v4.0 MSigDB collections C1 through C6. We refer to the original gene sets as "founder" sets. Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures. To cite your use of the collection, and for further information, please refer to Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015 Dec 23;1(6):417-425. C1 collection: Positional gene setsGene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene. Cytogenetic locations were parsed from HUGO, October 2006, and UniGene, build 197. We merged the relevant annotations from these resources and derived a single cytogenetic band location for every gene symbol. These were then grouped into sets. Decimals in cytogenetic bands were ignored. For example, 5q31.1 was considered 5q31. Therefore, genes annotated as 5q31.2 and those annotated as 5q31.3 were both placed in the same set, 5q31. When there were conflicts, the UniGene entry was used. These gene sets can be helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.C2 collection: Curated gene setsGene sets curated from various sources such as online pathway databases, the biomedical literature, and contributions from domain experts. The gene set page for each gene set lists its source. The C2 collection is divided into the following two sub-collections: Chemical and genetic perturbations (CGP) and Canonical pathways (CP).> C2 sub-collection CGP: Chemical and genetic perturbationsGene sets that represent expression signatures of genetic and chemical perturbations.Most of the CGP sets came from the biomedical literature. Over the past several years, microarray studies have identified signatures of several important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). The C2 collection makes many of these signatures, originally published as tables in a paper, available as gene sets. To do this, we compiled a list of microarray articles with published gene expression signatures and, from each article, extracted one or more gene sets from tables in the main text or supplementary information. A number of these gene sets come in pairs: xxx_UP (and xxx_DN) gene sets representing genes induced (and repressed) by the perturbation. The majority of CGP sets were curated from publications. They include links to the PubMed citation, the exact source of the set (e.g., Table 1), and links to any corresponding raw data in GEO or ArrayExpress repositories. When the gene set involves a genetic perturbation, the set's brief description includes a link to the gene's entry in the NCBI Entrez Gene database. When the gene set involves a chemical perturbation, the set's brief description includes a link to the chemical's entry in the NCBI PubChem Compound database. Other CGP gene sets include:
> C2 sub-collection CP: Canonical pathwaysThe pathway gene sets are curated from the following online databases:
C3 collection: Motif gene setsGene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist of genes grouped by short sequence motifs they share in their non-protein coding regions. The motifs represent known or likely cis-regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in an expression profiling experiment to a putative cis-regulatory element. The C3 collection is divided into two sub-collections: microRNA targets (MIR) and transcription factor targets (TFT).> C3 sub-collection MIR: microRNA targetsThese sets consist of genes sharing 7-nucleotide motifs in their 3' untranslated regions. Each 7-mer motif matches (is complementary to) the seed (bases 2 through 8) of the mature human microRNA (miRNAs) catalogued in v7.1 of miRBase (October 2005).> C3 sub-collection TFT: transcription factor targetsGene sets that share upstream cis-regulatory motifs which can function as potential transcription factor binding sites. We used two approaches to generate these motif gene sets.
C4 collection: Computational gene setsComputational gene sets defined by mining large collections of cancer-oriented microarray data. This collection is divided into two sub-collections: Cancer gene neighborhoods (CGN) and Cancer modules (CM).> C4 sub-collection CGN: Cancer gene neighborhoodsIn our GSEA paper, Subramanian, Tamayo et al. 2005, PNAS 102, 15545-15550, we mined 4 expression compendia datasets for correlated gene sets, starting with a list of 380 cancer-associated genes curated from internal resources and Brentani, Caballero et al. Human Cancer Genome Project/Cancer Genome Anatomy Project Annotation Consortium.; Human Cancer Genome Project Sequencing Consortium. The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags. Proc Natl Acad Sci U S A. 2003 Nov 11;100(23):13418-23. Using the profile of a given gene as a template, we ordered every other gene in the data set by its Pearson correlation coefficient. We applied a cutoff of R ≥ 0.85 to extract correlated genes. The calculation of neighborhoods is done independently in each compendium. In this way, a given oncogene may have up to four "types" of neighborhoods according to the correlation present in each compendium. Neighborhoods with <25 genes at this threshold were omitted yielding the final 427 sets.
> C4 sub-collection CM: Cancer modulesGene sets defined by Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004 Oct;36(10):1090-8. Briefly, the authors compiled gene sets ('modules') from a variety of resources such as KEGG, GO, and others. By mining a large compendium of cancer-related microarray data, they identified 456 such modules as significantly changed in a variety of cancer conditions. See also http://robotics.stanford.edu/~erans/cancer.C5 collection: Gene Ontology (GO) gene setsGene sets in this collection are derived from Gene Ontology (GO) annotations. GO is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and their products. A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. A gene product might be associated with one or more GO terms. Each annotation also includes an evidence code to indicate how the annotation to a particular term is supported (http://geneontology.org/page/guide-go-evidence-codes).The gene sets in the C5 collection are based on GO terms (go-basic.obo, downloaded on 3 May, 2016) and their associations to human genes (gene2go, downloaded from NCBI FTP server on 3 May, 2016). The GO terms in the collection belong to one of three GO ontologies: molecular function (MF), cellular component (CC) or biological process (BP), and the collection is divided into sub-collections accordingly. We omitted GO terms for very broad categories that would produce extremely large gene sets. GO terms that produced gene sets with fewer than 10 genes have also been omitted. We defined sets as "highly similar" if their Jaccard's coefficient was > 0.85. For each pair of highly similar sets we kept only the larger set, and repeated the procedure until all such pairs were resolved. Note to GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes. C6 collection: Oncogenic signaturesGene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications.C7 collection: Immunologic signaturesImmunologic signatures collection (also called ImmuneSigDB) is composed of gene sets that represent cell types, states, and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.We first captured relevant microarray datasets published in the immunology literature that have raw data deposited to Gene Expression Omnibus (GEO). For each published study, the relevant comparisons were identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions were created. All data was processed and normalized the same way to identify the gene sets, which correspond to the top or bottom genes (FDR < 0.02 or maximum of 200 genes) ranked by mutual information for each assigned comparison. The immunologic signatures collection was generated as part of our collaboration with the Haining Lab at Dana-Farber Cancer Institute and the Human Immunology Project Consortium (HIPC). To cite your use of the collection, and for further information, please refer to Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte A, Mesirov JP, Haining WN, Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation, 2016, Immunity 44(1), 194-206. |
![]() ![]() |
Copyright (c) 2004-2017 Broad Institute, Inc.,
Massachusetts Institute of Technology, and Regents of the University of California.
All rights reserved.
|
MSigDB database v6.1 updated October 2017
GSEA/MSigDB web site v6.3 released January 2018 |