Difference between revisions of "MSigDB v5.2 Release Notes"
m |
m |
||
(22 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | <span class="plainlinks"> | ||
[http://www.broadinstitute.org/gsea/ GSEA Home] | | [http://www.broadinstitute.org/gsea/ GSEA Home] | | ||
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | | [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | | ||
Line 4: | Line 5: | ||
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] | | [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] | | ||
[http://www.broadinstitute.org/gsea/contact.jsp Contact] | [http://www.broadinstitute.org/gsea/contact.jsp Contact] | ||
− | < | + | </span> |
− | |||
<p>This page describes the changes made to the gene set collections for Release 5.2 of the Molecular Signatures Database (MSigDB).</p> | <p>This page describes the changes made to the gene set collections for Release 5.2 of the Molecular Signatures Database (MSigDB).</p> | ||
− | < | + | <h2>Overhaul of C5 collection - gene ontology</h2> |
− | + | <p> Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (<span class="plainlinks">[http://www.geneontology.org Nature Genet 2000]</span>). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations.</p> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | <p> Gene sets in this | ||
<p>This collection is divided into three subcollections:</p> | <p>This collection is divided into three subcollections:</p> | ||
<ul> | <ul> | ||
Line 23: | Line 15: | ||
<li><strong>BP</strong>: GO Biological process (+4653 gene sets). Gene sets derived from the Biological Process Ontology.</li> | <li><strong>BP</strong>: GO Biological process (+4653 gene sets). Gene sets derived from the Biological Process Ontology.</li> | ||
</ul> | </ul> | ||
− | < | + | <p>''Outline of the procedure:''</p> |
+ | <p>All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.</p> | ||
+ | <p>The input files are:</p> | ||
+ | <ul> | ||
+ | <li><b>gene2go</b> (downloaded on May 3, 2016)</li> | ||
+ | <p>This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the [http://www.geneontology.org/GO.current.annotations.shtml GO FTP site] and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in <tt>gene_info.gz</tt>. The file is available <span class="plainlinks">[ftp://ftp.ncbi.nih.gov/gene/DATA/ here]</span>. It is a tab delimited plain text file with one <tt>tax_id / gene_id / evidence_code</tt> per line.</p> | ||
+ | <li><b>go-basic.obo</b>(downloaded on May 3, 2016)</li> | ||
+ | <p>This file contains the entire GO ontology in <span class="plainlinks">[http://owlcollab.github.io/oboformat/doc/GO.format.obo-1_2.html OBO v.1.2 format].</span></p> | ||
+ | </ul> | ||
+ | <p>First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 10 or more than 2,000 Entrez Gene IDs. Finally, we resolved redundancies as follows. We computed <tt>Jaccard's coefficients</tt> for each pair of sets, and marked a pair as highly similar if its <tt>Jaccard's coefficients</tt> was greater than 0.85. We then grouped pairs of highly similar sets into "chunks" according to their GO terms and applied two rounds of filtering for every "chunk". The first round was computational: we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar pairs of sets of identical sizes, which we further pruned manually by preferably keeping the more general set (i.e., the set with the more general GO term in the ontology tree).</p> | ||
+ | <p>The previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These depricated C5 sets are included in MSigDB 5.2 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.</p> | ||
+ | <h2>Updates to C2:CGP - curated gene sets</h2> | ||
+ | <p>We added 4 gene sets (submitted by Dr. Eppert, MgGill University) and redefined the gene set <span class="plainlinks">[http://software.broadinstitute.org/gsea/msigdb/cards/MCLACHLAN_DENTAL_CARIES_DN.html MCLACHLAN_DENTAL_CARIES_DN]</span></p> | ||
+ | <h2>Updates to C6 collection - oncogenic signatures</h2> | ||
Corrected typing errors in the names of two gene sets. | Corrected typing errors in the names of two gene sets. | ||
− | < | + | <h2>Updates to C7 collection - immunologic signatures</h2> |
Corrected typing errors in the names of 332 gene sets and fixed other errors in gene set annotations. | Corrected typing errors in the names of 332 gene sets and fixed other errors in gene set annotations. | ||
− | < | + | <h2>For more information</h2> |
For complete descriptions of all collections or to download the updated gene sets, go to the [http://www.broad.mit.edu/gsea/msigdb/collections.jsp Browse Collections] page. | For complete descriptions of all collections or to download the updated gene sets, go to the [http://www.broad.mit.edu/gsea/msigdb/collections.jsp Browse Collections] page. | ||
+ | Detailed information about v5.1 MSigDB gene sets that have been renamed, deprecated or archived in the 5.2 release can be found [[Mapping_between_v5.2_and_v5.1_gene_sets|here]] |
Latest revision as of 11:44, 13 October 2016
GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact
This page describes the changes made to the gene set collections for Release 5.2 of the Molecular Signatures Database (MSigDB).
Contents
Overhaul of C5 collection - gene ontology
Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations.
This collection is divided into three subcollections:
- CC: GO Cellular component (+584 gene sets). Gene sets derived from the Cellular Component Ontology.
- MF: GO Molecular function (+929 gene sets). Gene sets derived from the Molecular Function Ontology.
- BP: GO Biological process (+4653 gene sets). Gene sets derived from the Biological Process Ontology.
Outline of the procedure:
All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.
The input files are:
- gene2go (downloaded on May 3, 2016)
- go-basic.obo(downloaded on May 3, 2016)
This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.
This file contains the entire GO ontology in OBO v.1.2 format.
First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 10 or more than 2,000 Entrez Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard's coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard's coefficients was greater than 0.85. We then grouped pairs of highly similar sets into "chunks" according to their GO terms and applied two rounds of filtering for every "chunk". The first round was computational: we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar pairs of sets of identical sizes, which we further pruned manually by preferably keeping the more general set (i.e., the set with the more general GO term in the ontology tree).
The previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These depricated C5 sets are included in MSigDB 5.2 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.
Updates to C2:CGP - curated gene sets
We added 4 gene sets (submitted by Dr. Eppert, MgGill University) and redefined the gene set MCLACHLAN_DENTAL_CARIES_DN
Updates to C6 collection - oncogenic signatures
Corrected typing errors in the names of two gene sets.
Updates to C7 collection - immunologic signatures
Corrected typing errors in the names of 332 gene sets and fixed other errors in gene set annotations.
For more information
For complete descriptions of all collections or to download the updated gene sets, go to the Browse Collections page. Detailed information about v5.1 MSigDB gene sets that have been renamed, deprecated or archived in the 5.2 release can be found here