Difference between revisions of "MSigDB collections"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
m
 
(50 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<a href="http://www.broadinstitute.org/gsea/">GSEA Home</a> |  
+
[http://www.broadinstitute.org/gsea/ GSEA Home] |
<a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> |  
+
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |  
<a href="http://www.broadinstitute.org/gsea/msigdb/">Molecular Signatures Database</a> |  
+
[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |  
[[Main_Page|Documentation]] |
+
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
<a href="http://www.broadinstitute.org/gsea/contact.jsp">Contact</a>
+
[http://www.broadinstitute.org/gsea/contact.jsp Contact]<br>
<br>
 
<br>
 
<p>This page provides detailed descriptions of all collections of gene sets in MSigDB.</p>
 
<p>To learn about changes and other information specific for a particular release of MSigDB, please refer to the corresponding [[Release_Notes]].</p>
 
<h2>H: Hallmarks</h2>
 
<p>some text</p>
 
<h2>C1: positional gene sets</h2>
 
<p>Genes from the same genomic location (chromosome or cytogenetic band) are grouped in a gene set. Cytogenetic annotations are from three sources:</p>
 
<ol>
 
<li>[http://www.genenames.org Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC)]</li>
 
<li>[http://www.ncbi.nlm.nih.gov/unigene/ UniGene]</li>
 
<li>[http://www.affymetrix.com Affymetrix] microarray annotations</li>
 
</ol>
 
<p>We merged the relevant annotations from these resources and derived a single cytogenetic band location for every gene symbol. These were then grouped into sets. Decimals in cytogenetic bands were ignored. For example, 5q31.1 was considered 5q31. Therefore, genes annotated as 5q31.2 and those annotated as 5q31.3 were both placed in the same set, 5q31.</p>
 
<p>When there were conflicts, the UniGene entry was used.</p>
 
<p>These sets are helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.</p>
 
  
 
+
<p>
<h2>C2: curated gene sets</h2>
+
See the [http://software.broadinstitute.org/gsea/msigdb/collections.jsp MSigDB Collections page] on the main website.
Gene sets collected from various sources such as online pathway databases, scientific publications and personal contributions from domain experts.
+
</p>
<h3>CGP: chemical and genetic perturbations</h3>
 
<ul>
 
<li>Sets curated from biomedical literature
 
<p>Over the past few years, microarray studies have identified signatures of several important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). These gene sets are valuable biological results. Unfortunately, because gene sets are typically published as tables in a paper, the important biological findings they represent are not easily accessible to computational tools. Our first goal was to convert published gene sets into an electronic form. Towards this we compiled a list of microarray articles with published gene expression signatures. From each article, we extracted one or more gene sets from tables in the main text or supplementary information. Notably, our focus was on capturing the identity (e.g. gene symbol, GenBank accession) of all members in a gene set rather than on relationships between individual genes.</p>
 
<p>A number of these gene sets come in pairs: an xxx_UP (xxx_DN) gene set representing genes induced (repressed) by the perturbation.<p>
 
<p>Names of CGP sets start with the last name of the first author of the source publication. The majority of CGP sets were curated from publications and include links to the [http://www.ncbi.nlm.nih.gov/pubmed PubMed] citation, the exact source of the set (e.g., Table 1) and links to the corresponding raw data in [http://www.ncbi.nlm.nih.gov/geo/ GEO] or [http://www.ebi.ac.uk/arrayexpress/ ArrayExpress] repositories. When the set involves a genetic perturbation, brief description includes a link to the gene's entry in the NCBI [http://www.ncbi.nlm.nih.gov/gene Entrez Gene] database. When the set involves a chemical perturbation, brief description includes a link to the chemical's entry in the NCBI [ PubChem Compound] database.</p>
 
  <ul>
 
        <li>curated by the MSigDB curation team</li>
 
        <li>contributed by the [http://depts.washington.edu/l2l/ L2L] database</li>
 
<p>These sets came from the [http://depts.washington.edu/l2l/ L2L] database of  published microarray gene expression data (described in [http://genomebiology.com/2005/6/9/R81 Newman and Weiner]) and were kindly shared with MSigDB. These sets list John Newman as the contributor.</p>
 
  </ul>
 
</li>
 
<li>Sets contributed by Dr. Chi Dang from the [http://www.myccancergene.org/site/mycTargetDB.asp MYC Target Gene Database].</li>
 
<li>Individual gene set compilations contributed by MSigDB collaborators. These sets usually are not based on any specific publication.</li>
 
</ul>
 
 
 
<h3>CP: canonical pathways</h3>
 
Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. These gene sets were either extracted from the online sources by MSigDB curation team or were contributed by teams of pathway databases in collaboration with MSigDB curation team.
 
<ul>
 
<li>[http://www.biocarta.com/ BioCarta]</li>
 
<li>[http://www.genome.jp/kegg KEGG]</li>
 
<li>[http://pid.nci.nih.gov/ Pathway Interaction Database (PID)]</li>
 
<li>[http://www.reactome.org/ Reactome]</li>
 
<li>[http://www.sigmaaldrich.com/life-science.html SigmaAldrich]</li>
 
<li>[http://www.signaling-gateway.org/ Signaling Gateway]</li>
 
<li>[http://stke.sciencemag.org/ Signal Transduction KE (STKE)]</li>
 
<li>[http://www.superarray.com/ SuperArray]</li>
 
</ul>
 
 
 
<h2>C3: motif gene sets</h2>
 
<p>Gene sets group genes by <i>cis</i>-regulatory motifs. The motifs are catalogued in [http://www.nature.com/nature/journal/v434/n7031/abs/nature03441.html Xie et al.] and represent known or putative conserved regulatory elements in promoters and 3’-UTR regions. These sets make it possible to link changes in a genomic experiment to a conserved, putative cis-regulatory elements.</p>
 
<h4>Transcription factor targets (TFT)</h4>
 
<p>These sets share upstream <i>cis</i>-regulatory motifs which can function as potential transcripton factor binding sites. We used two approaches to generate these gene sets.</p>
 
<p>Motif gene sets of ‘conserved instances’ consist of the inferred target genes for each motif <strong>m</strong> of 174 upstream motifs highly conserved among five mammalian species (<i>H. sapiens</i>, <i>M. musculus</i>, <i>R. norvegicus</i> and <i>C. lupus familiaris</i>).  The motifs are catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. (2005, Nature 434, 338–345)] and represent potential transcription factor binding sites. Each motif gene set consists of all human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contained at least one conserved instance of motif <strong>m</strong>.  If the motif’s sequence matched a transcription factor binding site documented in v7.4 [http://www.gene-regulation.com/ TRANSFAC] database, then we appended the name of the TRANSFAC binding matrix to the motif sequence in the gene name, e.g.: MOTIFSEQ_FOO, where MOTIFSEQ is the sequence of motif <strong>m</strong> and FOO is the TRANSFAC matrix name (e.g., V$MIF1_01).  The set’s full description in this case is the TRANSFAC entry for the matching matrix.  If the motif’s sequence matched no transcription factor binding site from TRANSFAC v.7.4, then we named the set as MOTIFSEQ_UNKNOWN where MOTIFSEQ is the consensus sequence of motif <strong>m</strong>.</p>
 
<p>We also extracted 460 mammalian transcriptional regulatory motifs from v7.4 [http://www.gene-regulation.com/ TRANSFAC] database.  We then generated the motif gene sets consisting of the inferred target genes for each motif.  Every such set consists of human genes whose promoters (defined as regions -2kb to +2kb around transcription start site) contain at least one instance of the motif.  We named these sets by the corresponding TRANSFAC matrix identifiers, e.g., V$MIF1_01.  The set’s full description is the TRANSFAC entry for the matching matrix, in a format described [http://www.gene-regulation.com/pub/databases/transfac/doc/matrix1SM.html here].</p>
 
<h4>microRNA Targets (MIR)</h4>
 
<p>These gene sets consist of the inferred target gene for each motif <strong>m</strong> of 221 3'-UTR motifs highly conserved among five mammalian species (<i>H. sapiens</i>, <i>M. musculus</i>, <i>R. norvegicus</i> and <i>C. lupus familiaris</i>).  The motifs are catalogued catalogued in [http://www.ncbi.nlm.nih.gov/pubmed/15735639 Xie, et al. 2005, Nature 434, 338–345] and represent potential microRNA binding sites.  Each motif gene set consists of all genes whose 3’-UTR contained at least one conserved instance of motif <strong>m</strong>.</p>
 
 
 
<h2>C4: computational gene sets</h2>
 
Gene sets defined by mining large collections of cancer-oriented genes.
 
<h5>CGN: Cancer Gene Neighborhoods</h5>
 
These sets are defined by expression neighborhoods centered on cancer-related genes. This collection has originally been described in [http://www.pnas.org/content/102/43/15545.abstract Subramanian, Tamayo et al. 2005].
 
 
 
Starting with a list of 380 cancer associated genes curated from internal resources and a published cancer gene database [REF], [http://www.pnas.org/content/102/43/15545.abstract Subramanian, Tamayo et al. 2005] mined four expression compendia datasets for correlated gene sets. Using the profile of a given gene as a template, Subramanian and Tamayo ordered every other gene in the data set by its Pearson correlation coefficient. A cutoff of <i>R</i> &le; 0.85 was then applied to extract correlated genes. The calculation of neighborhoods was done independently in each data set. In this way, a given oncogene may have up to four "types" of neighborhoods according to the correlation present in each compendium. Neighborhoods with &lg; 25 genes at this threshold were omitted yielding the final 427 sets.
 
<p>The names of these gene sets start with a code indicating the corresponding expression compendium followed by the symbol of the cancer associated gene.</p>
 
<p>The compendia and their codes are listed below:</p>
 
<ul>
 
<li><strong>GNF2</strong> [http://www.pnas.org/content/101/16/6062.abstract Novartis normal tissue compendium]</li>
 
<li><strong>CAR</strong> [http://cancerres.aacrjournals.org/content/61/20/7388.abstract Novartis carcinoma compendium]</li>
 
<li><strong>GCM</strong> [http://www.pnas.org/content/98/26/15149.abstract Global cancer map]</li>
 
<li><strong>MORF</strong> [http://www.pnas.org/content/102/43/15545.abstract GSEA PNAS 2005]</li>
 
<p>An internal compendium of gene expression data sets, including many of Broad Institute's Cancer Program in-house Affymetrix HG-U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc.</p>
 
</ul>
 
 
 
<h5>CM: Cancer Modules</h5>
 
<p>Gene sets are <i>identical</i> to the modules described in [http://www.nature.com/ng/journal/v36/n10/abs/ng1434.html Segal et al]. Starting with 2,849 gene sets from a variety of resources such as KEGG, Gene Ontology, and others, the authors mined a large compendium of cancer related microarray data and identified 456 transcriptionally co-regulated modules. Two sets with fewer than 10 NCBI human Entrez Gene IDs have been deprecated since v3.0 MSigDB.</p>
 
<p>The names of these sets start with MODULE_ followed by the number of module according to the contributor's notes. Gene set pages contain external links to further details about these sets.</p>
 
 
 
<h2>C5: GO gene sets</h2>
 
<p>Gene sets are named by [http://www.geneontology.org Gene Ontology (GO)] terms and contain genes annotated by that term.</p>
 
<p> Gene sets in this new collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. (2000) 25: 25-29 (http://www.geneontology.org/). The gene sets are named by GO term and contain genes annotated by that term.</p>
 
<p>This collection is divided into three subcollections:</p>
 
<ul>
 
    <li><strong>CC</strong>: GO Cellular component (+233 gene sets). Gene sets derived from the Cellular Component Ontology.</li>
 
    <li><strong>MF</strong>: GO Molecular function (+396 gene sets). Gene sets derived from the Molecular Function Ontology.</li>
 
    <li><strong>BP</strong>: GO Biological process (+825 gene sets). Gene sets derived from the Biological Process Ontology.</li>
 
</ul>
 
 
 
<h2>C6: oncogenetic signatures</h2>
 
<p>Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from in house unpublished expression profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures was curated from scientific publications.</p>
 
 
 
<h2>C7: immunologic signatures</h2>
 
<p>Gene sets that represent cell states and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology. For each study, pairwise comparisons of relevant classes were made and genes ranked by mutual information. Gene sets correspond to top or bottom ranking genes (FDR < 0.25 or maximum of 200 genes) for each comparison. This resource is generated as part of the [http://www.immuneprofiling.org Human Immunology Project Consortium (HIPC)].</p>
 

Latest revision as of 21:02, 5 April 2017