https://software.broadinstitute.org/cancer/software/gsea/wiki/api.php?action=feedcontributions&user=Eby&feedformat=atom GeneSetEnrichmentAnalysisWiki - User contributions [en] 2024-03-28T12:21:57Z User contributions MediaWiki 1.34.4 https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4555 MSigDB SQLite Database 2023-04-28T23:49:33Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets overlapping with a list of genes using Jaccard Similarity&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (here is an [https://data.broadinstitute.org/gsea-msigdb/msigdb/example/members.txt example file]). Use MGI symbols if working with the Mouse database:<br /> &lt;pre&gt;<br /> .import --schema test members.txt member_list<br /> .mode tabs<br /> .headers on<br /> WITH QuerySet(member) AS (SELECT symbol FROM member_list)<br /> SELECT standard_name, sum(InQuerySet) AS UnionCount,<br /> (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,<br /> CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +<br /> (SELECT count(member) FROM QuerySet)) AS JaccSim<br /> FROM ( SELECT standard_name,<br /> CASE WHEN symbol IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS InQuerySet,<br /> CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS NotInQuerySet<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )<br /> GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;<br /> <br /> standard_name UnionCount IntersectionCount JaccSim<br /> SOGA_COLORECTAL_CANCER_MYC_UP 79 170 0.464705882352941<br /> WP_PYRIMIDINE_METABOLISM 24 227 0.105726872246696<br /> KEGG_PURINE_METABOLISM 31 295 0.105084745762712<br /> KEGG_PYRIMIDINE_METABOLISM 24 241 0.0995850622406639<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 18 191 0.0942408376963351<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 16 185 0.0864864864864865<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19 225 0.0844444444444444<br /> REACTOME_METABOLISM_OF_NUCLEOTIDES 20 244 0.0819672131147541<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16 211 0.0758293838862559<br /> REACTOME_NUCLEOTIDE_BIOSYNTHESIS 11 170 0.0647058823529412<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 11 178 0.0617977528089888<br /> MODULE_219 11 183 0.0601092896174863<br /> SCHUHMACHER_MYC_TARGETS_UP 14 233 0.0600858369098712<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 11 201 0.054726368159204<br /> GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19 348 0.0545977011494253<br /> GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS 24 440 0.0545454545454545<br /> GOBP_GMP_BIOSYNTHETIC_PROCESS 9 172 0.0523255813953488<br /> GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS 20 385 0.051948051948052<br /> MODULE_102 9 177 0.0508474576271186<br /> GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS 9 177 0.0508474576271186<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;About SQLite&lt;/h2&gt;<br /> &lt;p&gt;<br /> The official SQLite documentation is available at [https://www.sqlite.org https://www.sqlite.org] and an (unofficial) introductory tutorial is available at [https://www.sqlitetutorial.net https://www.sqlitetutorial.net]. <br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> As a single-file database format, SQLite is well suited to our needs.<br /> &lt;ul&gt;<br /> &lt;li&gt;It's self-contained (https://www.sqlite.org/about.html)<br /> &lt;ul&gt;<br /> &lt;li&gt;It's not a networked client-server DB like MySQL, PostgreSQL, etc. so there is no additional set-up, administration, or maintenance in running the database.&lt;/li&gt;<br /> &lt;li&gt;A database is held in a single file, matching the idea of a portable database analogous to our existing XML format.&lt;/li&gt;<br /> &lt;li&gt;The “engine” is a small program (~1.1 MB) which reads local files.&lt;/li&gt;<br /> &lt;li&gt;Aside from initial installation, it’s ready to use directly.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It has a full-featured SQL implementation (https://www.sqlite.org/fullsql.html)<br /> &lt;ul&gt;&lt;li&gt;A relational model gives a better representation of MSigDB contents than XML can.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It's very fast, especially compared to processing XML. The developers say it's &quot;faster than the filesystem&quot; (https://www.sqlite.org/fasterthanfs.html).&lt;/li&gt;<br /> &lt;li&gt;It’s free and Open Source (Public Domain)&lt;/li&gt;<br /> &lt;li&gt;It’s ubiquitous and widely used (https://www.sqlite.org/mostdeployed.html).&lt;/li&gt;<br /> &lt;li&gt;There are programming language bindings for Python, R, Java (JDBC), Julia, C, etc.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_v2023.1.Mm_Release_Notes&diff=4554 MSigDB v2023.1.Mm Release Notes 2023-04-06T21:19:11Z <p>Eby: </p> <hr /> <div>&lt;span class=&quot;plainlinks&quot;&gt;<br /> [http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;/span&gt;<br /> <br /> &lt;h1&gt;Important Notices&lt;/h1&gt;<br /> <br /> This page describes updates made to the Molecular Signatures Database Mouse Collections for release 2023.1 (MSigDB 2023.1.Mm).<br /> <br /> '''In order to access the MSigBD mouse collections through the GSEA UI, the GSEA 4.3.0 or newer is required.'''<br /> <br /> MSigDB v2023.1 is based on gene annotation data from Ensembl Release 109 (Feb 2023).<br /> <br /> <br /> &lt;h1&gt;Updates to Mouse Collections (MSigDB v2023.1.Mm)&lt;/h1&gt;<br /> <br /> &lt;h2&gt;M1: positional gene sets&lt;/h2&gt;<br /> As previously noted in the [[MSigDB_v2022.1.Mm_Release_Notes]] the underlying data for the M1 collection remains based on the cytogenetic band annotations provided in the Ensembl 102 release, corresponding to the GRCm38 assembly as cytogentic band annotations for GRCm39 remain unavailable, however gene identifiers have been updated.<br /> <br /> &lt;h2&gt;M2:CGP&lt;/h2&gt;<br /> <br /> 3 Gene sets contributed by MSigDB users have been added to M2:CGP<br /> &lt;ul&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/mouse/geneset/SAUL_SEN_MAYO SAUL_SEN_MAYO]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/mouse/geneset/MA_RAT_AGING_UP MA_RAT_AGING_UP]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/mouse/geneset/MA_RAT_AGING_DN MA_RAT_AGING_DN]&lt;/span&gt;&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;M2:CP:Reactome&lt;/h2&gt;<br /> <br /> &lt;ul&gt;<br /> &lt;li&gt;Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of '''Reactome v83''' (+2 gene sets).&lt;/li&gt;<br /> &lt;li&gt;As previously described in the [[MSigDB_v7.0_Release_Notes#C2:CP:Reactome_-_Major_overhaul | Reactome release notes for MSigDB 7.0]], in order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard coefficients and distance from the top level of the Reactome event hierarchy.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;h2&gt;M2:CP:WikiPathways&lt;/h2&gt;<br /> WikiPathways gene sets have been updated to the February 10, 2023 release (+XX gene sets).<br /> <br /> &lt;h2&gt;M3:GTRD&lt;/h2&gt;<br /> &lt;p&gt;GTRD data was updated to the 21.12 release (+7 gene sets).&lt;/p&gt;<br /> <br /> &lt;h2&gt;M5:GO (Gene Ontology)&lt;/h2&gt;<br /> &lt;p&gt; Gene sets in these sub-collections are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (&lt;span class=&quot;plainlinks&quot;&gt;[http://www.geneontology.org Nature Genet 2000]&lt;/span&gt;). The gene sets are named by GO term and contain genes annotated by that term. This collection has been updated to the most recent GO annotations as present in the GO-basic obo file released on 2023-01-01 and NCBI gene2go annotations downloaded on 2023-02-10.&lt;/p&gt;<br /> <br /> &lt;p&gt;This collection is divided into three sub-collections:&lt;/p&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;&lt;strong&gt;BP&lt;/strong&gt;: GO Biological process (+67 gene sets). Gene sets derived from the Biological Process Ontology, which are prefixed with &quot;GOBP_&quot;.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;CC&lt;/strong&gt;: GO Cellular component (-11 gene sets). Gene sets derived from the Cellular Component Ontology, which are prefixed with &quot;GOCC_&quot;.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;MF&lt;/strong&gt;: GO Molecular function (+57 gene sets). Gene sets derived from the Molecular Function Ontology, which are prefixed with &quot;GOMF_&quot;..&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;p&gt;These updates were generated in accordance with the procedure described in the [[MSigDB_v7.0_Release_Notes#C5_.28Gene_Ontology_collection.29_-_Major_overhaul | GO release notes for MSigDB 7.0.]]&lt;/p&gt;<br /> <br /> <br /> &lt;h2&gt;M8 cell type signature gene sets&lt;/h2&gt;<br /> &lt;p&gt;Added gene sets describing uterine cell type identity signatures from &lt;span class=&quot;plainlinks&quot;&gt;[https://pubmed.ncbi.nlm.nih.gov/35669188/ Zhang, et al. 2022 Digital Cell Atlas of Mouse Uterus: From Regenerative Stage to Maturational Stage.]&lt;/span&gt; (+18 gene sets)&lt;/p&gt;<br /> <br /> &lt;h2&gt;CHIP file updates&lt;/h2&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;MSigDB 2023.1.Mm gene annotations and gene mapping CHIP files have been updated to data from Ensembl 109.&lt;/li&gt;<br /> &lt;li&gt;Gene orthology annotations for mapping human and rat genes to their best match mouse orthologs have been updated to &lt;span class=&quot;plainlinks&quot;&gt;[https://www.alliancegenome.org/ Alliance of Genome Resources]&lt;/span&gt; orthology database release 5.3.0 (2022-10-28)&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;h2&gt;SQLite Database&lt;/h2&gt;<br /> &lt;p&gt;With this release we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. This new format brings the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. See our [[MSigDB_SQLite_Database|documentation]] for more details on the contents and usage.&lt;/p&gt;<br /> &lt;p&gt;Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.&lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_v2023.1.Hs_Release_Notes&diff=4553 MSigDB v2023.1.Hs Release Notes 2023-04-06T21:18:48Z <p>Eby: </p> <hr /> <div>&lt;span class=&quot;plainlinks&quot;&gt;<br /> [http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;/span&gt;<br /> <br /> &lt;h1&gt;Important Notices&lt;/h1&gt;<br /> <br /> This page describes updates made to the Molecular Signatures Database Human Collections for release 2023.1 (MSigDB 2023.1.Hs).<br /> <br /> '''In order to access the MSigBD mouse collections through the GSEA UI, the GSEA 4.3.0 or newer is required.'''<br /> <br /> MSigDB v2023.1 is based on gene annotation data from Ensembl Release 109 (Feb 2023).<br /> <br /> <br /> &lt;h1&gt;Updates to Human Collections (MSigDB v2023.1.Hs)&lt;/h1&gt;<br /> <br /> &lt;h2&gt;C1: positional gene sets&lt;/h2&gt;<br /> Updated human gene annotations to Ensembl 109 (+1 gene set).<br /> &lt;h2&gt;C2:CGP&lt;/h2&gt;<br /> <br /> 6 Gene sets contributed by MSigDB users have been added to C2:CGP<br /> &lt;ul&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/SAUL_SEN_MAYO SAUL_SEN_MAYO]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/MA_RAT_AGING_UP MA_RAT_AGING_UP]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/MA_RAT_AGING_DN MA_RAT_AGING_DN]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/NOURUZI_NEPC_ASCL1_TARGETS NOURUZI_NEPC_ASCL1_TARGETS]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/KOHN_EMT_EPITHELIAL KOHN_EMT_EPITHELIAL]&lt;/span&gt;&lt;/li&gt;<br /> &lt;li&gt;&lt;span class=&quot;plainlinks&quot;&gt;[https://gsea-msigdb.org/gsea/msigdb/human/geneset/KOHN_EMT_MESENCHYMAL KOHN_EMT_MESENCHYMAL]&lt;/span&gt;&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;C2:CP:Reactome&lt;/h2&gt;<br /> <br /> &lt;ul&gt;<br /> &lt;li&gt;Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of '''Reactome v83''' (+19 gene sets).&lt;/li&gt;<br /> &lt;li&gt;As previously described in the [[MSigDB_v7.0_Release_Notes#C2:CP:Reactome_-_Major_overhaul | Reactome release notes for MSigDB 7.0]], in order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard coefficients and distance from the top level of the Reactome event hierarchy.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;h2&gt;C2:CP:WikiPathways&lt;/h2&gt;<br /> WikiPathways gene sets have been updated to the February 10, 2023 release (+21 gene sets).<br /> <br /> &lt;h2&gt;C3:TFT:GTRD&lt;/h2&gt;<br /> &lt;p&gt;GTRD data was updated to the 21.12 release. (-12 gene sets)&lt;/p&gt;<br /> <br /> &lt;h2&gt;C5:GO (Gene Ontology)&lt;/h2&gt;<br /> &lt;p&gt; Gene sets in these sub-collections are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (&lt;span class=&quot;plainlinks&quot;&gt;[http://www.geneontology.org Nature Genet 2000]&lt;/span&gt;). The gene sets are named by GO term and contain genes annotated by that term. This collection has been updated to the most recent GO annotations as present in the GO-basic obo file released on 2023-01-01 and NCBI gene2go annotations downloaded on 2023-02-10.&lt;/p&gt;<br /> <br /> &lt;p&gt;This collection is divided into three sub-collections:&lt;/p&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;&lt;strong&gt;BP&lt;/strong&gt;: GO Biological process (-12 gene sets). Gene sets derived from the Biological Process Ontology, which are prefixed with &quot;GOBP_&quot;.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;CC&lt;/strong&gt;: GO Cellular component (-26 gene sets). Gene sets derived from the Cellular Component Ontology, which are prefixed with &quot;GOCC_&quot;.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;MF&lt;/strong&gt;: GO Molecular function (+9 gene sets). Gene sets derived from the Molecular Function Ontology, which are prefixed with &quot;GOMF_&quot;..&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;p&gt;These updates were generated in accordance with the procedure described in the [[MSigDB_v7.0_Release_Notes#C5_.28Gene_Ontology_collection.29_-_Major_overhaul | GO release notes for MSigDB 7.0.]]&lt;/p&gt;<br /> <br /> &lt;h2&gt;C5:HPO (Human Phenotype Ontology)&lt;/h2&gt;<br /> <br /> Gene sets in this sub-collection have been updated to reflect the 2023-01-27 release of the Human Phenotype Ontology database (+263 gene sets). This sub-collection has been redundancy filtered through a procedure comparable to that of the GO and Reactome sub-collections.<br /> <br /> &lt;h2&gt;C8 cell type signature gene sets&lt;/h2&gt;<br /> &lt;p&gt;Added gene sets describing lung cell type identity signatures from &lt;span class=&quot;plainlinks&quot;&gt;[https://pubmed.ncbi.nlm.nih.gov/36493756/ He P., Lim K., et al. 2022 A human fetal lung cell atlas uncovers proximal-distal gradients of differentiation and key regulators of epithelial fates.]&lt;/span&gt; &lt;span class=&quot;plainlinks&quot;&gt;(https://lungcellatlas.org)&lt;/span&gt; (+126 gene sets)&lt;/p&gt;<br /> <br /> &lt;h2&gt;CHIP file updates&lt;/h2&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;MSigDB 2023.1.Hs gene annotations and gene mapping CHIP files have been updated to data from Ensembl 109.&lt;/li&gt;<br /> &lt;li&gt;Gene orthology annotations for mapping mouse and rat genes to their best match human orthologs have been updated to &lt;span class=&quot;plainlinks&quot;&gt;[https://www.alliancegenome.org/ Alliance of Genome Resources]&lt;/span&gt; orthology database release 5.3.0 (2022-10-28)&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;h2&gt;SQLite Database&lt;/h2&gt;<br /> &lt;p&gt;With this release we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. This new format brings the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. See our [[MSigDB_SQLite_Database|documentation]] for more details on the contents and usage.&lt;/p&gt;<br /> &lt;p&gt;Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.&lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4550 MSigDB SQLite Database 2023-03-24T19:16:38Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets overlapping with a list of genes using Jaccard Similarity&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (here is an [https://data.broadinstitute.org/gsea-msigdb/msigdb/example/members.txt example file]). Use MGI symbols if working with the Mouse database:<br /> &lt;pre&gt;<br /> .import members.txt member_list<br /> .mode tabs<br /> .headers on<br /> WITH QuerySet(member) AS (SELECT symbol FROM member_list)<br /> SELECT standard_name, sum(InQuerySet) AS UnionCount,<br /> (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,<br /> CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +<br /> (SELECT count(member) FROM QuerySet)) AS JaccSim<br /> FROM ( SELECT standard_name,<br /> CASE WHEN symbol IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS InQuerySet,<br /> CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS NotInQuerySet<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )<br /> GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;<br /> <br /> standard_name UnionCount IntersectionCount JaccSim<br /> SOGA_COLORECTAL_CANCER_MYC_UP 79 170 0.464705882352941<br /> WP_PYRIMIDINE_METABOLISM 24 227 0.105726872246696<br /> KEGG_PURINE_METABOLISM 31 295 0.105084745762712<br /> KEGG_PYRIMIDINE_METABOLISM 24 241 0.0995850622406639<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 18 191 0.0942408376963351<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 16 185 0.0864864864864865<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19 225 0.0844444444444444<br /> REACTOME_METABOLISM_OF_NUCLEOTIDES 20 244 0.0819672131147541<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16 211 0.0758293838862559<br /> REACTOME_NUCLEOTIDE_BIOSYNTHESIS 11 170 0.0647058823529412<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 11 178 0.0617977528089888<br /> MODULE_219 11 183 0.0601092896174863<br /> SCHUHMACHER_MYC_TARGETS_UP 14 233 0.0600858369098712<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 11 201 0.054726368159204<br /> GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19 348 0.0545977011494253<br /> GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS 24 440 0.0545454545454545<br /> GOBP_GMP_BIOSYNTHETIC_PROCESS 9 172 0.0523255813953488<br /> GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS 20 385 0.051948051948052<br /> MODULE_102 9 177 0.0508474576271186<br /> GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS 9 177 0.0508474576271186<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;About SQLite&lt;/h2&gt;<br /> &lt;p&gt;<br /> The official SQLite documentation is available at [https://www.sqlite.org https://www.sqlite.org] and an (unofficial) introductory tutorial is available at [https://www.sqlitetutorial.net https://www.sqlitetutorial.net]. <br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> As a single-file database format, SQLite is well suited to our needs.<br /> &lt;ul&gt;<br /> &lt;li&gt;It's self-contained (https://www.sqlite.org/about.html)<br /> &lt;ul&gt;<br /> &lt;li&gt;It's not a networked client-server DB like MySQL, PostgreSQL, etc. so there is no additional set-up, administration, or maintenance in running the database.&lt;/li&gt;<br /> &lt;li&gt;A database is held in a single file, matching the idea of a portable database analogous to our existing XML format.&lt;/li&gt;<br /> &lt;li&gt;The “engine” is a small program (~1.1 MB) which reads local files.&lt;/li&gt;<br /> &lt;li&gt;Aside from initial installation, it’s ready to use directly.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It has a full-featured SQL implementation (https://www.sqlite.org/fullsql.html)<br /> &lt;ul&gt;&lt;li&gt;A relational model gives a better representation of MSigDB contents than XML can.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It's very fast, especially compared to processing XML. The developers say it's &quot;faster than the filesystem&quot; (https://www.sqlite.org/fasterthanfs.html).&lt;/li&gt;<br /> &lt;li&gt;It’s free and Open Source (Public Domain)&lt;/li&gt;<br /> &lt;li&gt;It’s ubiquitous and widely used (https://www.sqlite.org/mostdeployed.html).&lt;/li&gt;<br /> &lt;li&gt;There are programming language bindings for Python, R, Java (JDBC), Julia, C, etc.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4549 MSigDB SQLite Database 2023-03-24T02:50:42Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets overlapping with a list of genes using Jaccard Similarity&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (an example file is here &lt;link&gt;). Use MGI symbols if working with the Mouse database:<br /> &lt;pre&gt;<br /> .import members.txt member_list<br /> .mode tabs<br /> .headers on<br /> WITH QuerySet(member) AS (SELECT symbol FROM member_list)<br /> SELECT standard_name, sum(InQuerySet) AS UnionCount,<br /> (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,<br /> CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +<br /> (SELECT count(member) FROM QuerySet)) AS JaccSim<br /> FROM ( SELECT standard_name,<br /> CASE WHEN symbol IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS InQuerySet,<br /> CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS NotInQuerySet<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )<br /> GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;<br /> <br /> standard_name UnionCount IntersectionCount JaccSim<br /> SOGA_COLORECTAL_CANCER_MYC_UP 79 170 0.464705882352941<br /> WP_PYRIMIDINE_METABOLISM 24 227 0.105726872246696<br /> KEGG_PURINE_METABOLISM 31 295 0.105084745762712<br /> KEGG_PYRIMIDINE_METABOLISM 24 241 0.0995850622406639<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 18 191 0.0942408376963351<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 16 185 0.0864864864864865<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19 225 0.0844444444444444<br /> REACTOME_METABOLISM_OF_NUCLEOTIDES 20 244 0.0819672131147541<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16 211 0.0758293838862559<br /> REACTOME_NUCLEOTIDE_BIOSYNTHESIS 11 170 0.0647058823529412<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 11 178 0.0617977528089888<br /> MODULE_219 11 183 0.0601092896174863<br /> SCHUHMACHER_MYC_TARGETS_UP 14 233 0.0600858369098712<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 11 201 0.054726368159204<br /> GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19 348 0.0545977011494253<br /> GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS 24 440 0.0545454545454545<br /> GOBP_GMP_BIOSYNTHETIC_PROCESS 9 172 0.0523255813953488<br /> GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS 20 385 0.051948051948052<br /> MODULE_102 9 177 0.0508474576271186<br /> GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS 9 177 0.0508474576271186<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;About SQLite&lt;/h2&gt;<br /> &lt;p&gt;<br /> The official SQLite documentation is available at [https://www.sqlite.org https://www.sqlite.org] and an (unofficial) introductory tutorial is available at [https://www.sqlitetutorial.net https://www.sqlitetutorial.net]. <br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> As a single-file database format, SQLite is well suited to our needs.<br /> &lt;ul&gt;<br /> &lt;li&gt;It's self-contained (https://www.sqlite.org/about.html)<br /> &lt;ul&gt;<br /> &lt;li&gt;It's not a networked client-server DB like MySQL, PostgreSQL, etc. so there is no additional set-up, administration, or maintenance in running the database.&lt;/li&gt;<br /> &lt;li&gt;A database is held in a single file, matching the idea of a portable database analogous to our existing XML format.&lt;/li&gt;<br /> &lt;li&gt;The “engine” is a small program (~1.1 MB) which reads local files.&lt;/li&gt;<br /> &lt;li&gt;Aside from initial installation, it’s ready to use directly.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It has a full-featured SQL implementation (https://www.sqlite.org/fullsql.html)<br /> &lt;ul&gt;&lt;li&gt;A relational model gives a better representation of MSigDB contents than XML can.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It's very fast, especially compared to processing XML. The developers say it's &quot;faster than the filesystem&quot; (https://www.sqlite.org/fasterthanfs.html).&lt;/li&gt;<br /> &lt;li&gt;It’s free and Open Source (Public Domain)&lt;/li&gt;<br /> &lt;li&gt;It’s ubiquitous and widely used (https://www.sqlite.org/mostdeployed.html).&lt;/li&gt;<br /> &lt;li&gt;There are programming language bindings for Python, R, Java (JDBC), Julia, C, etc.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4548 MSigDB SQLite Database 2023-03-24T02:44:11Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;/pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets overlapping with a list of genes using Jaccard Similarity&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (an example file is here &lt;link&gt;). Use MGI symbols if working with the Mouse database:<br /> &lt;pre&gt;<br /> .import members.txt member_list<br /> .mode tabs<br /> .headers on<br /> WITH QuerySet(member) AS (SELECT symbol FROM member_list)<br /> SELECT standard_name, sum(InQuerySet) AS UnionCount,<br /> (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,<br /> CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +<br /> (SELECT count(member) FROM QuerySet)) AS JaccSim<br /> FROM ( SELECT standard_name,<br /> CASE WHEN symbol IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS InQuerySet,<br /> CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS NotInQuerySet<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )<br /> GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;<br /> <br /> standard_name UnionCount IntersectionCount JaccSim<br /> SOGA_COLORECTAL_CANCER_MYC_UP 79 170 0.464705882352941<br /> WP_PYRIMIDINE_METABOLISM 24 227 0.105726872246696<br /> KEGG_PURINE_METABOLISM 31 295 0.105084745762712<br /> KEGG_PYRIMIDINE_METABOLISM 24 241 0.0995850622406639<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 18 191 0.0942408376963351<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 16 185 0.0864864864864865<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19 225 0.0844444444444444<br /> REACTOME_METABOLISM_OF_NUCLEOTIDES 20 244 0.0819672131147541<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16 211 0.0758293838862559<br /> REACTOME_NUCLEOTIDE_BIOSYNTHESIS 11 170 0.0647058823529412<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 11 178 0.0617977528089888<br /> MODULE_219 11 183 0.0601092896174863<br /> SCHUHMACHER_MYC_TARGETS_UP 14 233 0.0600858369098712<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 11 201 0.054726368159204<br /> GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19 348 0.0545977011494253<br /> GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS 24 440 0.0545454545454545<br /> GOBP_GMP_BIOSYNTHETIC_PROCESS 9 172 0.0523255813953488<br /> GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS 20 385 0.051948051948052<br /> MODULE_102 9 177 0.0508474576271186<br /> GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS 9 177 0.0508474576271186<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;About SQLite&lt;/h2&gt;<br /> &lt;p&gt;<br /> The official SQLite documentation is available at [https://www.sqlite.org https://www.sqlite.org] and an (unofficial) introductory tutorial is available at [https://www.sqlitetutorial.net https://www.sqlitetutorial.net]. <br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> As a single-file database format, SQLite is well suited to our needs.<br /> &lt;ul&gt;<br /> &lt;li&gt;It's self-contained (https://www.sqlite.org/about.html)<br /> &lt;ul&gt;<br /> &lt;li&gt;It's not a networked client-server DB like MySQL, PostgreSQL, etc. so there is no additional set-up, administration, or maintenance in running the database.&lt;/li&gt;<br /> &lt;li&gt;A database is held in a single file, matching the idea of a portable database analogous to our existing XML format.&lt;/li&gt;<br /> &lt;li&gt;The “engine” is a small program (~1.1 MB) which reads local files.&lt;/li&gt;<br /> &lt;li&gt;Aside from initial installation, it’s ready to use directly.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It has a full-featured SQL implementation (https://www.sqlite.org/fullsql.html)<br /> &lt;ul&gt;&lt;li&gt;A relational model gives a better representation of MSigDB contents than XML can.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;It's very fast, especially compared to processing XML. The developers say it's &quot;faster than the filesystem&quot; (https://www.sqlite.org/fasterthanfs.html).&lt;/li&gt;<br /> &lt;li&gt;It’s free and Open Source (Public Domain)&lt;/li&gt;<br /> &lt;li&gt;It’s ubiquitous and widely used (https://www.sqlite.org/mostdeployed.html).&lt;/li&gt;<br /> &lt;li&gt;There are programming language bindings for Python, R, Java (JDBC), Julia, C, etc.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4547 MSigDB SQLite Database 2023-03-24T02:34:13Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;/pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets overlapping with a list of genes using Jaccard Similarity&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (an example file is here &lt;link&gt;). Use MGI symbols if working with the Mouse database:<br /> &lt;pre&gt;<br /> .import members.txt member_list<br /> .mode tabs<br /> .headers on<br /> WITH QuerySet(member) AS (SELECT symbol FROM member_list)<br /> SELECT standard_name, sum(InQuerySet) AS UnionCount,<br /> (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,<br /> CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +<br /> (SELECT count(member) FROM QuerySet)) AS JaccSim<br /> FROM ( SELECT standard_name,<br /> CASE WHEN symbol IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS InQuerySet,<br /> CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) <br /> THEN 1 ELSE 0 END AS NotInQuerySet<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )<br /> GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;<br /> <br /> standard_name UnionCount IntersectionCount JaccSim<br /> SOGA_COLORECTAL_CANCER_MYC_UP 79 170 0.464705882352941<br /> WP_PYRIMIDINE_METABOLISM 24 227 0.105726872246696<br /> KEGG_PURINE_METABOLISM 31 295 0.105084745762712<br /> KEGG_PYRIMIDINE_METABOLISM 24 241 0.0995850622406639<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 18 191 0.0942408376963351<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 16 185 0.0864864864864865<br /> GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19 225 0.0844444444444444<br /> REACTOME_METABOLISM_OF_NUCLEOTIDES 20 244 0.0819672131147541<br /> GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16 211 0.0758293838862559<br /> REACTOME_NUCLEOTIDE_BIOSYNTHESIS 11 170 0.0647058823529412<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS 11 178 0.0617977528089888<br /> MODULE_219 11 183 0.0601092896174863<br /> SCHUHMACHER_MYC_TARGETS_UP 14 233 0.0600858369098712<br /> GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 11 201 0.054726368159204<br /> GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19 348 0.0545977011494253<br /> GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS 24 440 0.0545454545454545<br /> GOBP_GMP_BIOSYNTHETIC_PROCESS 9 172 0.0523255813953488<br /> GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS 20 385 0.051948051948052<br /> MODULE_102 9 177 0.0508474576271186<br /> GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS 9 177 0.0508474576271186<br /> &lt;/pre&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4546 MSigDB SQLite Database 2023-03-24T02:28:52Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query gets some more detailed information about a particular named gene set, including the PubMed ID:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .headers on<br /> SELECT collection_name, license_code, PMID AS PubMedID, description_brief<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN publication pub ON pub.id = publication_id<br /> WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';<br /> <br /> collection_name license_code PubMedID description_brief<br /> C2:CGP CC-BY-4.0 17404513 Cell cycle genes significantly (p =&lt; 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Now, get the Title and Authors for the PubMed ID from the above:<br /> &lt;/pre&gt;<br /> SELECT title, group_concat(display_name) AS Authors<br /> FROM publication pub<br /> INNER JOIN publication_author pa ON publication_id = pub.id<br /> INNER JOIN author au ON author_id = au.id<br /> WHERE PMID = 17404513;<br /> <br /> title Authors<br /> Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:<br /> &lt;pre&gt;<br /> SELECT et.term, external_name<br /> FROM external_term et<br /> INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term<br /> INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id<br /> WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';<br /> <br /> term external_name<br /> GO:0051256 mitotic spindle midzone assembly<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting a summary of gene sets&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:<br /> &lt;pre&gt;<br /> SELECT standard_name, count(gene_symbol_id), collection_name,<br /> source_species_code, ns.label, contributor, PMID<br /> FROM gene_set gset<br /> INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id<br /> INNER JOIN namespace ns ON ns.id = primary_namespace_id<br /> LEFT JOIN publication pub ON publication_id = pub.id<br /> INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id<br /> WHERE collection_name LIKE &quot;C5:GO:%&quot;<br /> GROUP BY standard_name ORDER BY standard_name limit 3;<br /> <br /> standard_name count(gene_symbol_id) collection_name source_species_code label contributor PMID<br /> GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 6 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2FE_2S_CLUSTER_ASSEMBLY 11 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS 17 C5:GO:BP HS Human_NCBI_Gene_ID Gene Ontology <br /> &lt;/pre&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4545 MSigDB SQLite Database 2023-03-24T02:21:18Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once BRCA1_BRCA2_sets.gmt<br /> SELECT standard_name,<br /> (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE gset.id IN<br /> ( SELECT distinct(gene_set_id)<br /> FROM gene_set_gene_symbol gsgs2<br /> INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') )<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4544 MSigDB SQLite Database 2023-03-24T02:14:56Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets and their metadata&lt;/h3&gt;<br /> &lt;p&gt;<br /> This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways_threshold.gmt<br /> SELECT standard_name,<br /> ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )<br /> ||'/'||standard_name,<br /> group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500<br /> ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:<br /> &lt;pre&gt;<br /> SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'<br /> &lt;/pre&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4543 MSigDB SQLite Database 2023-03-24T02:06:00Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Extracting gene sets in the GMT format&lt;/h3&gt;<br /> &lt;p&gt;<br /> One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once wikipathways.gmt<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE collection_name = 'C2:CP:WIKIPATHWAYS'<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The basic template for creating GMTs is as follows:<br /> &lt;pre&gt;<br /> .mode tabs<br /> .once &lt;filename&gt;<br /> SELECT standard_name 'na', group_concat(symbol, ' ')<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id<br /> INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id<br /> WHERE &lt;selection criteria&gt;<br /> GROUP BY standard_name ORDER BY standard_name ASC;<br /> &lt;/pre&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the ''group_concat'' function is a quoted tab character.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Finding gene sets containing one or more specified genes&lt;/h3&gt;<br /> &lt;p&gt;<br /> Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:<br /> &lt;pre&gt;<br /> SELECT distinct(standard_name)<br /> FROM gene_set gset<br /> INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id<br /> INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id<br /> WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;<br /> <br /> AAAYWAACM_HFH4_01<br /> ACTAYRNNNCCCR_UNKNOWN<br /> ACTGAAA_MIR30A3P_MIR30E3P<br /> ARID3B_TARGET_GENES<br /> ASH1L_TARGET_GENES<br /> &lt;...etc...&gt;<br /> &lt;/pre&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4542 MSigDB SQLite Database 2023-03-24T01:58:46Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;h2&gt;Example Queries&lt;/h2&gt;<br /> &lt;p&gt;<br /> The examples given here assume we are working with the MSigDB Human database from our [https://www.gsea-msigdb.org/gsea/downloads.jsp Downloads] page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> These examples also assume the use of the [https://sqlite.org/cli.html official SQLite command line shell] to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4541 MSigDB SQLite Database 2023-03-24T01:54:55Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;br/&gt;<br /> &lt;p&gt;<br /> The &quot;external item&quot; (gray) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables ''external_term'' and ''external_term_filtered_by_similarity'' contain this information. &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4540 MSigDB SQLite Database 2023-03-24T01:52:42Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Here are some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4539 MSigDB SQLite Database 2023-03-24T01:49:02Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;We associate publication and author info to gene sets through the correspondingly-named tables (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4538 MSigDB SQLite Database 2023-03-24T01:44:18Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The [ http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] for MSigDB are available on our website.<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set_details'' table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the ''source_species_code'' column.&lt;/li&gt;<br /> &lt;li&gt;The ''added_in_MSigDB_id'', ''changed_in_MSigDB_id'', and ''changed_reason'' columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;li&gt;The ''external_details_URL'' column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the ''primary_namespace_id'', ''secondary_namespace_id'', and their count in ''num_namespaces''. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The ''exact_source'' column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''collection'' table holds the information for each MSigDB Collection. For convenience, the ''collection_name'' column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_license'' table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our [ http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp License Terms] page for more info.&lt;/li&gt;<br /> &lt;li&gt;The ''MSigDB'' table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;We associate publication and author info to gene sets through the correspondingly-named tables (joined by ''publication_author''). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the '''publication itself''' for the most accurate authorship info.&lt;br/&gt;<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4537 MSigDB SQLite Database 2023-03-24T01:35:36Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an ''id'' primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the ''id'' of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_set'' table holds the core information about each gene set. Note that the ''collection_name'' and ''license_code'' columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The ''tags'' column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The ''gene_symbol'' table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The ''namespace_id'' will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The ''gene_set_gene_symbol'' table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The ''source_member'' table contains these original gene set member identifiers (joined via ''gene_set_source_member'').<br /> &lt;ul&gt;<br /> &lt;li&gt;The ''gene_symbol_id'' column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The metadata (purple) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The gene_set_details table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.&lt;br/&gt;<br /> Some columns of note:<br /> &lt;ul&gt;<br /> &lt;li&gt;While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the source_species_code column.&lt;/li&gt;<br /> &lt;li&gt;The added_in_MSigDB_id, changed_in_MSigDB_id, and changed_reason columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.&lt;/li&gt;<br /> &lt;li&gt;The external_details_URL column may actually contain multiple URLs. These will be separated by the pipe character ('|').&lt;/li&gt;<br /> &lt;li&gt;While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the primary_namespace_id, secondary_namespace_id, and their count in num_namespaces. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.&lt;/li&gt;<br /> &lt;li&gt;The exact_source column holds information on finding the source of the gene get from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The collection table holds the information for each MSigDB Collection. For convenience, the collection_name column encodes the full collection hierarchy information, in the form &quot;C5:GO:BP&quot; or &quot;M2:CP:REACTOME&quot; for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.&lt;/li&gt;<br /> &lt;li&gt;The gene_set_license table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our License Terms page for more info.&lt;/li&gt;<br /> &lt;li&gt;The MSigDB table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release.<br /> While these older records are not currently referenced, they are included to cover the future intent to add revision history in the added_in_MSigDB_id and changed_in_MSigDB_id columns of the gene_set_details table as mentioned earlier.&lt;/li&gt;<br /> &lt;li&gt;The namespace and species tables allow us to label source_member and gene_symbol records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.&lt;/li&gt;<br /> &lt;li&gt;We associate publication and author info to gene sets through the correspondingly-named tables (joined by publication_author). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the publication itself for the most accurate authorship info.<br /> There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through &quot;placeholder&quot; publication records with titles like &quot;Placeholder publication for M2872,M2873&quot;, where the identifiers at the end are the systematic_name(s) of the corresponding gene set.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4536 MSigDB SQLite Database 2023-03-24T01:24:25Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The core (blue) tables:<br /> &lt;ul&gt;<br /> &lt;li&gt;The gene_set table holds the core information about each gene set. Note that the collection_name and license_code columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.<br /> &lt;ul&gt;&lt;li&gt;The tags column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.&lt;/li&gt;&lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;li&gt;The gene_symbol table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The namespace_id will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.&lt;/li&gt;<br /> &lt;li&gt;The gene_set_gene_symbol table joins the gene sets to its member gene symbols.&lt;/li&gt;<br /> &lt;li&gt;In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The source_member table contains these original gene set member identifiers (joined via gene_set_source_member).<br /> &lt;ul&gt;<br /> &lt;li&gt;The gene_symbol_id column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.&lt;/li&gt;<br /> &lt;li&gt;These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4535 MSigDB SQLite Database 2023-03-24T01:17:40Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png|900px]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4534 MSigDB SQLite Database 2023-03-24T01:15:54Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> [[File:Msigdb_release.png]]<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=File:Msigdb_release.png&diff=4533 File:Msigdb release.png 2023-03-24T01:14:04Z <p>Eby: </p> <hr /> <div></div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4532 MSigDB SQLite Database 2023-03-24T01:12:13Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and '''will not''' be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.<br /> &lt;/p&gt;<br /> &lt;p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4531 MSigDB SQLite Database 2023-03-24T01:10:08Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> &lt;p&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> &lt;/p&gt;<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> &lt;p&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> &lt;/p&gt;<br /> &lt;p&gt;<br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> &lt;/p&gt;<br /> &lt;h3&gt;Schema&lt;/h3&gt;<br /> &lt;p&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.<br /> &lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4530 MSigDB SQLite Database 2023-03-24T01:06:55Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> <br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> <br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> <br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> <br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> <br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.<br /> <br /> &lt;h2&gt;Schema&lt;/h2&gt;<br /> Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_SQLite_Database&diff=4529 MSigDB SQLite Database 2023-03-24T01:06:01Z <p>Eby: Created page with '[http://www.broadinstitute.org/gsea/ GSEA Home] | [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures…'</p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> &lt;h2&gt;Introduction&lt;/h2&gt;<br /> With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.<br /> <br /> Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.<br /> <br /> Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.<br /> <br /> The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp<br /> <br /> &lt;h2&gt;Database Design&lt;/h2&gt;<br /> &lt;h3&gt;Design Considerations&lt;/h3&gt;<br /> The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.<br /> <br /> Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.<br /> <br /> This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.3.x_Release_Notes&diff=4515 GSEA v4.3.x Release Notes 2022-10-03T22:17:25Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.2 (Oct 2022)&lt;/h2&gt;<br /> GSEA v4.3.2 is a bug-fix release to fix another issue with the species-consistency check when working with local files loaded from the user's computer.<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.1 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.1 is a bug-fix release to fix an issue with the species-consistency check when working with local files loaded from the user's computer.<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.0 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.0 is an update to support the new Mouse MSigDB database and our adjusted versioning scheme. See the [[MSigDB_v2022.1.Mm_Release_Notes|Mouse 2022.1.Mm release notes]] and <br /> [[MSigDB_v2022.1.Hs_Release_Notes|Human 2022.1.Hs release notes]] for more details about the changes to MSigDB.<br /> <br /> Users will notice a change in the tabs of both the Gene Set chooser and the CHIP chooser available from the Run GSEA and Run Preranked screens. These allow selecting either Human or Mouse oriented files from the MSigDB servers, to assist users in both choosing the correct files for their analysis and to ensure they are not improperly mixing these files (e.g. using a Mouse CHIP with a Human GMT, etc). Such mixing of files meant for different species will result in an error when it is detected.<br /> <br /> These file choosers will also give a new warning when they detect the mixing of files from different versions of MSigDB, even when the species matches.<br /> <br /> We have made the decision for now that GSEA will only make such checks on files coming directly from the MSigDB servers within that session. That is, GSEA will not make these checks on any files loaded from the user's computer through the Load Data screen since we can't guarantee the contents or naming of these files.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.3.x_Release_Notes&diff=4514 GSEA v4.3.x Release Notes 2022-09-19T21:01:52Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.1 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.1 is a bug-fix release to fix an issue with the species-consistency check when working with local files loaded from the user's computer.<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.0 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.0 is an update to support the new Mouse MSigDB database and our adjusted versioning scheme. See the [[MSigDB_v2022.1.Mm_Release_Notes|Mouse 2022.1.Mm release notes]] and <br /> [[MSigDB_v2022.1.Hs_Release_Notes|Human 2022.1.Hs release notes]] for more details about the changes to MSigDB.<br /> <br /> Users will notice a change in the tabs of both the Gene Set chooser and the CHIP chooser available from the Run GSEA and Run Preranked screens. These allow selecting either Human or Mouse oriented files from the MSigDB servers, to assist users in both choosing the correct files for their analysis and to ensure they are not improperly mixing these files (e.g. using a Mouse CHIP with a Human GMT, etc). Such mixing of files meant for different species will result in an error when it is detected.<br /> <br /> These file choosers will also give a new warning when they detect the mixing of files from different versions of MSigDB, even when the species matches.<br /> <br /> We have made the decision for now that GSEA will only make such checks on files coming directly from the MSigDB servers within that session. That is, GSEA will not make these checks on any files loaded from the user's computer through the Load Data screen since we can't guarantee the contents or naming of these files.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.3.x_Release_Notes&diff=4513 GSEA v4.3.x Release Notes 2022-09-19T20:59:41Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.1 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.1 is a bug-fix release to fix an issue with the species-consistency check when working with local files loaded from the user's computer.<br /> <br /> &lt;h2&gt; GSEA Desktop v4.3.0 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.0 is an update to support the new Mouse MSigDB database and our adjusted versioning scheme. See the [[MSigDB_v2022.1.Mm_Release_Notes|Mouse 2022.1.Mm release notes]] and <br /> [[MSigDB_v2022.1.Hs_Release_Notes|Human 2022.1.Hs release notes]] for more details about the changes to MSigDB.<br /> <br /> Users will notice a change in the tabs of both the Gene Set chooser and the CHIP chooser available from the Run GSEA and Run Preranked screens. These allow selecting either Human or Mouse oriented files from the MSigDB servers, to assist users in both choosing the correct files for their analysis and to ensure they are not improperly mixing these files (e.g. using a Mouse CHIP with a Human GMT, etc). Such mixing of files meant for different species will result in an error when it is detected.<br /> <br /> These file choosers will also give a new warning when they detect the mixing of files from different versions of MSigDB, even when the species matches.<br /> <br /> We have made the decision for now that GSEA will only make such checks on files coming directly from the MSigDB servers within that session. That is, GSEA will not make these checks on any files loaded from the user's computer through the Load Data screen since we can't guarantee the contents or naming of these files.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_Latest_Release_Notes&diff=4498 MSigDB Latest Release Notes 2022-09-07T20:55:58Z <p>Eby: </p> <hr /> <div>[[MSigDB_v2022.1.Hs_Release_Notes]]<br /> <br /> [[MSigDB_v2022.1.Mm_Release_Notes]]</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Release_Notes&diff=4497 Release Notes 2022-09-07T20:51:45Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;GSEA Software Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table width=&quot;700&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; height=&quot;78&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Sep 2022&lt;/td&gt;<br /> &lt;td&gt;4.3.&lt;em&gt;x&lt;/em&gt;*&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Support for the new Mouse MSigDB database and versioning scheme changes.<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.3.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Dec 2021 - Jan 2022&lt;/td&gt;<br /> &lt;td&gt;4.2.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;New metric (Spearman) and new collapse mode (Absolute Max), better handling of missing values and many other fixes. Updated to latest Log4J jars to avoid concerns of vulnerabilities in earlier Log4J versions.&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2020&lt;/td&gt;<br /> &lt;td&gt;4.1.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Improved support for macOS Catalina, updated and improved Enrichment Reports, and numerous bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.1.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019 - Nov 2019&lt;/td&gt;<br /> &lt;td&gt;4.0.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Updates for MSigDB 7.0, Java 11 compatibility, and better performance&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2017&lt;/td&gt;<br /> &lt;td&gt;3.0&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Open source release, with numerous improvements and bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015 - Apr 2017&lt;/td&gt;<br /> &lt;td&gt;2.2.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;2.1.0&lt;/td&gt;<br /> &lt;td&gt;Added Enrichment Map visualization of GSEA results&lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.1.0._Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2007 - Jan 2014&lt;/td&gt;<br /> &lt;td&gt;2.0.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2005&lt;/td&gt;<br /> &lt;td&gt;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;MSigDB Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;83&quot; width=&quot;637&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2022&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2022.1.Hs*&lt;/td&gt;<br /> &lt;td&gt;Human MSigDB under a new versioning scheme&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v2022.1.Hs_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2022&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2022.1.Mm*&lt;/td&gt;<br /> &lt;td&gt;Initial release of Mouse MSigDB&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v2022.1.Mm_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2022&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.5*&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.4&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.4_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.3&lt;/td&gt;<br /> &lt;td&gt;C2:CP:WikiPathways +15; C2:CP:Reactome +15; C3:GTRD +175 (bugfix); C5:GO -88; C5:HPO +319; C7:VAX (new sub-collection); C8: +333&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.3_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.2&lt;/td&gt;<br /> &lt;td&gt;C2:CGP +60; C2:CP:WikiPathways (new sub-collection); C2:CP:Reactome +22; C3:GTRD -176; C5:GO +79, C5:HPO (new sub-collection); C8: +51 (promoted from supplementary)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+28); C3 (+2904); C5(+196)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-27); C2 (+738); C5 (+4079); &lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jul 2018&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+24)&lt;br&gt;<br /> [[Mapping_between_v6.2_and_v6.1_gene_sets|Mapping between v6.2 and v6.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Oct 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+7)&lt;br&gt;<br /> [[Mapping_between_v6.1_and_v6.0_gene_sets|Mapping between v6.1 and v6.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.0&lt;/td&gt;<br /> &lt;td&gt;C2 (+2); C5 (-249)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+4); C5 (+4,712)&lt;br&gt;<br /> [[Mapping_between_v5.2_and_v5.1_gene_sets|Mapping between v5.2 and v5.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1); C7 (+2,962)&lt;/td&gt;<br /> &lt;td&gt; [[MSigDB_v5.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.0&lt;/td&gt;<br /> &lt;td&gt;H (+50); C2 (+3)&lt;br&gt;<br /> [[Mapping_between_v5.0_and_v4.0_gene_sets|Mapping between v5.0 and v4.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;May 2013&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;4.0&lt;/td&gt;<br /> &lt;td&gt;C2 (-128); C7 (+1,910)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v4.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1,578); C4 (-23); C6 (+189)&lt;br&gt;<br /> [[Mapping_between_v3.1_and_v3.0_gene_sets|Mapping between v3.0 and v3.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Sept 2010&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-60); C2 (+1,380); C3 (-1); C4 (-2) &lt;br&gt;<br /> [[Msigdb_mapping_v2.5_to_v3|Mapping between v2.5 and v3.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;April 2008&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.5&lt;/td&gt;<br /> &lt;td&gt; C2 (+205); C4 (+456); C5 (+1454) &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v2.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Feb 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.1&lt;/td&gt;<br /> &lt;td&gt;Minor updates to MSigDB v2.0 annotations &lt;/td&gt;<br /> &lt;td&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Jan 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.0&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+269); C3 (+214) &lt;br /&gt;<br /> [[Msigdb_mapping_v1_to_v2|Mapping between v1 and v2 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[Msigdb_may_2006_release_notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Nov 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.1&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+350); C3 (+566); C4&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[http://www.broadinstitute.org/gsea/doc/msigdb_nov_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;March 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt; [http://www.broadinstitute.org/gsea/doc/msigdb_march_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;Web site Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;78&quot; width=&quot;639&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;6.4*&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v7.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v6.4 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;6.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v6.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;5.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v5.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;4.05&lt;/td&gt;<br /> &lt;td&gt;Several minor updates&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2013&lt;/td&gt;<br /> &lt;td&gt;3.87&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.87 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;3.84&lt;/td&gt;<br /> &lt;td&gt;Several updates and new functionality.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2011&lt;/td&gt;<br /> &lt;td&gt;3.5&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and some new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.4 Release Notes|wiki]]&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;hr&gt;<br /> * Current release &lt;br /&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.3.x_Release_Notes&diff=4496 GSEA v4.3.x Release Notes 2022-09-07T20:37:31Z <p>Eby: Created page with '[http://www.broadinstitute.org/gsea/ GSEA Home] | [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures…'</p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.3.0 (Sep 2022)&lt;/h2&gt;<br /> GSEA v4.3.0 is an update to support the new Mouse MSigDB database and our adjusted versioning scheme. See the [[MSigDB_v2022.1.Mm_Release_Notes|Mouse 2022.1.Mm release notes]] and <br /> [[MSigDB_v2022.1.Hs_Release_Notes|Human 2022.1.Hs release notes]] for more details about the changes to MSigDB.<br /> <br /> Users will notice a change in the tabs of both the Gene Set chooser and the CHIP chooser available from the Run GSEA and Run Preranked screens. These allow selecting either Human or Mouse oriented files from the MSigDB servers, to assist users in both choosing the correct files for their analysis and to ensure they are not improperly mixing these files (e.g. using a Mouse CHIP with a Human GMT, etc). Such mixing of files meant for different species will result in an error when it is detected.<br /> <br /> These file choosers will also give a new warning when they detect the mixing of files from different versions of MSigDB, even when the species matches.<br /> <br /> We have made the decision for now that GSEA will only make such checks on files coming directly from the MSigDB servers within that session. That is, GSEA will not make these checks on any files loaded from the user's computer through the Load Data screen since we can't guarantee the contents or naming of these files.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4485 GSEA v4.2.x Release Notes 2022-03-02T03:41:21Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.3 (Mar 2022)&lt;/h2&gt;<br /> GSEA v4.2.3 is a security release, removing Log4J entirely from the code base. '''All users are encouraged to update!'''<br /> <br /> This also fixes an additional bug in the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.3 to evaluate the possible differences. Minimum dataset size warnings have been added as well, to note that GSEA should be run with data from all expressed genes rather than a reduced subset or &quot;Top DEGs&quot; list.<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.2 (Jan 2022)&lt;/h2&gt;<br /> GSEA v4.2.2 is a security release, updating to Log4J 2.17.1. '''All users are encouraged to update!'''<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.1 (Dec 2021)&lt;/h2&gt;<br /> GSEA v4.2.1 is a security release, updating to Log4J 2.17.0. '''All users are encouraged to update!'''<br /> <br /> There is one minor bug fix to the TXT parser to fix an error when no Description column is present. There are no other changes.<br /> <br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to a small value when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring modes but was not previously documented. For the &quot;weighted&quot; mode, the value is adjusted to 0.01. For the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes it is adjusted to 0.000001. The adjustment is not applied to the Classic K-S scoring mode since the expression values are not directly used with this mode.<br /> <br /> A similar adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for Infinite or NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value, for example).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4484 GSEA v4.2.x Release Notes 2022-03-02T02:20:18Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.3 (Mar 2022)&lt;/h2&gt;<br /> GSEA v4.2.3 is a security release, removing Log4J entirely from the code base. '''All users are encouraged to update!'''<br /> <br /> Minimum dataset size warnings have been added as well, to note that GSEA should be run with data from all expressed genes rather than a reduced subset or &quot;Top DEGs&quot; list.<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.2 (Jan 2022)&lt;/h2&gt;<br /> GSEA v4.2.2 is a security release, updating to Log4J 2.17.1. '''All users are encouraged to update!'''<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.1 (Dec 2021)&lt;/h2&gt;<br /> GSEA v4.2.1 is a security release, updating to Log4J 2.17.0. '''All users are encouraged to update!'''<br /> <br /> There is one minor bug fix to the TXT parser to fix an error when no Description column is present. There are no other changes.<br /> <br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to a small value when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring modes but was not previously documented. For the &quot;weighted&quot; mode, the value is adjusted to 0.01. For the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes it is adjusted to 0.000001. The adjustment is not applied to the Classic K-S scoring mode since the expression values are not directly used with this mode.<br /> <br /> A similar adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for Infinite or NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value, for example).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4483 GSEA v4.2.x Release Notes 2022-01-21T04:15:25Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.2 (Jan 2022)&lt;/h2&gt;<br /> GSEA v4.2.2 is a security release, updating to Log4J 2.17.1. '''All users are encouraged to update!'''<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.1 (Dec 2021)&lt;/h2&gt;<br /> GSEA v4.2.1 is a security release, updating to Log4J 2.17.0. '''All users are encouraged to update!'''<br /> <br /> There is one minor bug fix to the TXT parser to fix an error when no Description column is present. There are no other changes.<br /> <br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to a small value when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring modes but was not previously documented. For the &quot;weighted&quot; mode, the value is adjusted to 0.01. For the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes it is adjusted to 0.000001. The adjustment is not applied to the Classic K-S scoring mode since the expression values are not directly used with this mode.<br /> <br /> A similar adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for Infinite or NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value, for example).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Release_Notes&diff=4482 Release Notes 2022-01-21T02:07:00Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;GSEA Software Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table width=&quot;700&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; height=&quot;78&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Dec 2021 - Jan 2022&lt;/td&gt;<br /> &lt;td&gt;4.2.&lt;em&gt;x&lt;/em&gt;*&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;New metric (Spearman) and new collapse mode (Absolute Max), better handling of missing values and many other fixes. Updated to latest Log4J jars to avoid concerns of vulnerabilities in earlier Log4J versions.&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2020&lt;/td&gt;<br /> &lt;td&gt;4.1.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Improved support for macOS Catalina, updated and improved Enrichment Reports, and numerous bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.1.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019 - Nov 2019&lt;/td&gt;<br /> &lt;td&gt;4.0.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Updates for MSigDB 7.0, Java 11 compatibility, and better performance&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2017&lt;/td&gt;<br /> &lt;td&gt;3.0&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Open source release, with numerous improvements and bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015 - Apr 2017&lt;/td&gt;<br /> &lt;td&gt;2.2.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;2.1.0&lt;/td&gt;<br /> &lt;td&gt;Added Enrichment Map visualization of GSEA results&lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.1.0._Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2007 - Jan 2014&lt;/td&gt;<br /> &lt;td&gt;2.0.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2005&lt;/td&gt;<br /> &lt;td&gt;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;MSigDB Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;83&quot; width=&quot;637&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2022&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.5*&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.4&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.4_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.3&lt;/td&gt;<br /> &lt;td&gt;C2:CP:WikiPathways +15; C2:CP:Reactome +15; C3:GTRD +175 (bugfix); C5:GO -88; C5:HPO +319; C7:VAX (new sub-collection); C8: +333&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.3_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.2&lt;/td&gt;<br /> &lt;td&gt;C2:CGP +60; C2:CP:WikiPathways (new sub-collection); C2:CP:Reactome +22; C3:GTRD -176; C5:GO +79, C5:HPO (new sub-collection); C8: +51 (promoted from supplementary)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+28); C3 (+2904); C5(+196)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-27); C2 (+738); C5 (+4079); &lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jul 2018&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+24)&lt;br&gt;<br /> [[Mapping_between_v6.2_and_v6.1_gene_sets|Mapping between v6.2 and v6.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Oct 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+7)&lt;br&gt;<br /> [[Mapping_between_v6.1_and_v6.0_gene_sets|Mapping between v6.1 and v6.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.0&lt;/td&gt;<br /> &lt;td&gt;C2 (+2); C5 (-249)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+4); C5 (+4,712)&lt;br&gt;<br /> [[Mapping_between_v5.2_and_v5.1_gene_sets|Mapping between v5.2 and v5.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1); C7 (+2,962)&lt;/td&gt;<br /> &lt;td&gt; [[MSigDB_v5.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.0&lt;/td&gt;<br /> &lt;td&gt;H (+50); C2 (+3)&lt;br&gt;<br /> [[Mapping_between_v5.0_and_v4.0_gene_sets|Mapping between v5.0 and v4.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;May 2013&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;4.0&lt;/td&gt;<br /> &lt;td&gt;C2 (-128); C7 (+1,910)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v4.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1,578); C4 (-23); C6 (+189)&lt;br&gt;<br /> [[Mapping_between_v3.1_and_v3.0_gene_sets|Mapping between v3.0 and v3.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Sept 2010&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-60); C2 (+1,380); C3 (-1); C4 (-2) &lt;br&gt;<br /> [[Msigdb_mapping_v2.5_to_v3|Mapping between v2.5 and v3.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;April 2008&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.5&lt;/td&gt;<br /> &lt;td&gt; C2 (+205); C4 (+456); C5 (+1454) &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v2.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Feb 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.1&lt;/td&gt;<br /> &lt;td&gt;Minor updates to MSigDB v2.0 annotations &lt;/td&gt;<br /> &lt;td&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Jan 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.0&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+269); C3 (+214) &lt;br /&gt;<br /> [[Msigdb_mapping_v1_to_v2|Mapping between v1 and v2 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[Msigdb_may_2006_release_notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Nov 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.1&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+350); C3 (+566); C4&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[http://www.broadinstitute.org/gsea/doc/msigdb_nov_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;March 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt; [http://www.broadinstitute.org/gsea/doc/msigdb_march_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;Web site Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;78&quot; width=&quot;639&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;6.4*&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v7.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v6.4 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;6.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v6.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;5.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v5.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;4.05&lt;/td&gt;<br /> &lt;td&gt;Several minor updates&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2013&lt;/td&gt;<br /> &lt;td&gt;3.87&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.87 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;3.84&lt;/td&gt;<br /> &lt;td&gt;Several updates and new functionality.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2011&lt;/td&gt;<br /> &lt;td&gt;3.5&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and some new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.4 Release Notes|wiki]]&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;hr&gt;<br /> * Current release &lt;br /&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_Latest_Release_Notes&diff=4476 MSigDB Latest Release Notes 2022-01-13T06:23:51Z <p>Eby: Redirected page to MSigDB v7.5 Release Notes</p> <hr /> <div>#REDIRECT [[MSigDB_v7.5_Release_Notes]]</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Release_Notes&diff=4475 Release Notes 2022-01-13T06:23:23Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;GSEA Software Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table width=&quot;700&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; height=&quot;78&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Dec 2021&lt;/td&gt;<br /> &lt;td&gt;4.2.&lt;em&gt;x&lt;/em&gt;*&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;New metric (Spearman) and new collapse mode (Absolute Max), better handling of missing values and many other fixes. Updated to Log4J 2.16.0 to avoid concerns of vulnerabilities in earlier Log4J versions.&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2020&lt;/td&gt;<br /> &lt;td&gt;4.1.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Improved support for macOS Catalina, updated and improved Enrichment Reports, and numerous bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.1.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019 - Nov 2019&lt;/td&gt;<br /> &lt;td&gt;4.0.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Updates for MSigDB 7.0, Java 11 compatibility, and better performance&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2017&lt;/td&gt;<br /> &lt;td&gt;3.0&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Open source release, with numerous improvements and bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015 - Apr 2017&lt;/td&gt;<br /> &lt;td&gt;2.2.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;2.1.0&lt;/td&gt;<br /> &lt;td&gt;Added Enrichment Map visualization of GSEA results&lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.1.0._Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2007 - Jan 2014&lt;/td&gt;<br /> &lt;td&gt;2.0.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2005&lt;/td&gt;<br /> &lt;td&gt;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;MSigDB Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;83&quot; width=&quot;637&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2022&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.5*&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.4&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.4_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.3&lt;/td&gt;<br /> &lt;td&gt;C2:CP:WikiPathways +15; C2:CP:Reactome +15; C3:GTRD +175 (bugfix); C5:GO -88; C5:HPO +319; C7:VAX (new sub-collection); C8: +333&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.3_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.2&lt;/td&gt;<br /> &lt;td&gt;C2:CGP +60; C2:CP:WikiPathways (new sub-collection); C2:CP:Reactome +22; C3:GTRD -176; C5:GO +79, C5:HPO (new sub-collection); C8: +51 (promoted from supplementary)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+28); C3 (+2904); C5(+196)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-27); C2 (+738); C5 (+4079); &lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jul 2018&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+24)&lt;br&gt;<br /> [[Mapping_between_v6.2_and_v6.1_gene_sets|Mapping between v6.2 and v6.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Oct 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+7)&lt;br&gt;<br /> [[Mapping_between_v6.1_and_v6.0_gene_sets|Mapping between v6.1 and v6.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.0&lt;/td&gt;<br /> &lt;td&gt;C2 (+2); C5 (-249)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+4); C5 (+4,712)&lt;br&gt;<br /> [[Mapping_between_v5.2_and_v5.1_gene_sets|Mapping between v5.2 and v5.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1); C7 (+2,962)&lt;/td&gt;<br /> &lt;td&gt; [[MSigDB_v5.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.0&lt;/td&gt;<br /> &lt;td&gt;H (+50); C2 (+3)&lt;br&gt;<br /> [[Mapping_between_v5.0_and_v4.0_gene_sets|Mapping between v5.0 and v4.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;May 2013&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;4.0&lt;/td&gt;<br /> &lt;td&gt;C2 (-128); C7 (+1,910)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v4.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1,578); C4 (-23); C6 (+189)&lt;br&gt;<br /> [[Mapping_between_v3.1_and_v3.0_gene_sets|Mapping between v3.0 and v3.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Sept 2010&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-60); C2 (+1,380); C3 (-1); C4 (-2) &lt;br&gt;<br /> [[Msigdb_mapping_v2.5_to_v3|Mapping between v2.5 and v3.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;April 2008&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.5&lt;/td&gt;<br /> &lt;td&gt; C2 (+205); C4 (+456); C5 (+1454) &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v2.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Feb 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.1&lt;/td&gt;<br /> &lt;td&gt;Minor updates to MSigDB v2.0 annotations &lt;/td&gt;<br /> &lt;td&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Jan 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.0&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+269); C3 (+214) &lt;br /&gt;<br /> [[Msigdb_mapping_v1_to_v2|Mapping between v1 and v2 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[Msigdb_may_2006_release_notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Nov 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.1&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+350); C3 (+566); C4&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[http://www.broadinstitute.org/gsea/doc/msigdb_nov_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;March 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt; [http://www.broadinstitute.org/gsea/doc/msigdb_march_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;Web site Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;78&quot; width=&quot;639&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;6.4*&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v7.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v6.4 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;6.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v6.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;5.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v5.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;4.05&lt;/td&gt;<br /> &lt;td&gt;Several minor updates&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2013&lt;/td&gt;<br /> &lt;td&gt;3.87&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.87 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;3.84&lt;/td&gt;<br /> &lt;td&gt;Several updates and new functionality.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2011&lt;/td&gt;<br /> &lt;td&gt;3.5&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and some new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.4 Release Notes|wiki]]&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;hr&gt;<br /> * Current release &lt;br /&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4466 GSEA v4.2.x Release Notes 2021-12-23T06:51:11Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.1 (Dec 2021)&lt;/h2&gt;<br /> GSEA v4.2.1 is a security release, updating to Log4J 2.17.0. '''All users are encouraged to update!'''<br /> <br /> There is one minor bug fix to the TXT parser to fix an error when no Description column is present. There are no other changes.<br /> <br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to a small value when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring modes but was not previously documented. For the &quot;weighted&quot; mode, the value is adjusted to 0.01. For the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes it is adjusted to 0.000001. The adjustment is not applied to the Classic K-S scoring mode since the expression values are not directly used with this mode.<br /> <br /> A similar adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for Infinite or NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value, for example).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Release_Notes&diff=4465 Release Notes 2021-12-16T20:16:54Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;GSEA Software Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table width=&quot;700&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; height=&quot;78&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Dec 2021&lt;/td&gt;<br /> &lt;td&gt;4.2.&lt;em&gt;x&lt;/em&gt;*&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;New metric (Spearman) and new collapse mode (Absolute Max), better handling of missing values and many other fixes. Updated to Log4J 2.16.0 to avoid concerns of vulnerabilities in earlier Log4J versions.&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2020&lt;/td&gt;<br /> &lt;td&gt;4.1.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Improved support for macOS Catalina, updated and improved Enrichment Reports, and numerous bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.1.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019 - Nov 2019&lt;/td&gt;<br /> &lt;td&gt;4.0.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Updates for MSigDB 7.0, Java 11 compatibility, and better performance&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2017&lt;/td&gt;<br /> &lt;td&gt;3.0&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Open source release, with numerous improvements and bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015 - Apr 2017&lt;/td&gt;<br /> &lt;td&gt;2.2.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;2.1.0&lt;/td&gt;<br /> &lt;td&gt;Added Enrichment Map visualization of GSEA results&lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.1.0._Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2007 - Jan 2014&lt;/td&gt;<br /> &lt;td&gt;2.0.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2005&lt;/td&gt;<br /> &lt;td&gt;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;MSigDB Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;83&quot; width=&quot;637&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.4*&lt;/td&gt;<br /> &lt;td&gt;&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.4_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2021&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.3&lt;/td&gt;<br /> &lt;td&gt;C2:CP:WikiPathways +15; C2:CP:Reactome +15; C3:GTRD +175 (bugfix); C5:GO -88; C5:HPO +319; C7:VAX (new sub-collection); C8: +333&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.3_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Sep 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.2&lt;/td&gt;<br /> &lt;td&gt;C2:CGP +60; C2:CP:WikiPathways (new sub-collection); C2:CP:Reactome +22; C3:GTRD -176; C5:GO +79, C5:HPO (new sub-collection); C8: +51 (promoted from supplementary)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+28); C3 (+2904); C5(+196)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-27); C2 (+738); C5 (+4079); &lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jul 2018&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+24)&lt;br&gt;<br /> [[Mapping_between_v6.2_and_v6.1_gene_sets|Mapping between v6.2 and v6.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Oct 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+7)&lt;br&gt;<br /> [[Mapping_between_v6.1_and_v6.0_gene_sets|Mapping between v6.1 and v6.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.0&lt;/td&gt;<br /> &lt;td&gt;C2 (+2); C5 (-249)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+4); C5 (+4,712)&lt;br&gt;<br /> [[Mapping_between_v5.2_and_v5.1_gene_sets|Mapping between v5.2 and v5.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1); C7 (+2,962)&lt;/td&gt;<br /> &lt;td&gt; [[MSigDB_v5.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.0&lt;/td&gt;<br /> &lt;td&gt;H (+50); C2 (+3)&lt;br&gt;<br /> [[Mapping_between_v5.0_and_v4.0_gene_sets|Mapping between v5.0 and v4.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;May 2013&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;4.0&lt;/td&gt;<br /> &lt;td&gt;C2 (-128); C7 (+1,910)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v4.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1,578); C4 (-23); C6 (+189)&lt;br&gt;<br /> [[Mapping_between_v3.1_and_v3.0_gene_sets|Mapping between v3.0 and v3.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Sept 2010&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-60); C2 (+1,380); C3 (-1); C4 (-2) &lt;br&gt;<br /> [[Msigdb_mapping_v2.5_to_v3|Mapping between v2.5 and v3.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;April 2008&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.5&lt;/td&gt;<br /> &lt;td&gt; C2 (+205); C4 (+456); C5 (+1454) &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v2.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Feb 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.1&lt;/td&gt;<br /> &lt;td&gt;Minor updates to MSigDB v2.0 annotations &lt;/td&gt;<br /> &lt;td&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Jan 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.0&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+269); C3 (+214) &lt;br /&gt;<br /> [[Msigdb_mapping_v1_to_v2|Mapping between v1 and v2 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[Msigdb_may_2006_release_notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Nov 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.1&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+350); C3 (+566); C4&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[http://www.broadinstitute.org/gsea/doc/msigdb_nov_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;March 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt; [http://www.broadinstitute.org/gsea/doc/msigdb_march_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;Web site Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;78&quot; width=&quot;639&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;6.4*&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v7.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v6.4 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;6.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v6.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;5.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v5.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;4.05&lt;/td&gt;<br /> &lt;td&gt;Several minor updates&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2013&lt;/td&gt;<br /> &lt;td&gt;3.87&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.87 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;3.84&lt;/td&gt;<br /> &lt;td&gt;Several updates and new functionality.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2011&lt;/td&gt;<br /> &lt;td&gt;3.5&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and some new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.4 Release Notes|wiki]]&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;hr&gt;<br /> * Current release &lt;br /&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4464 GSEA v4.2.x Release Notes 2021-12-16T01:38:13Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to a small value when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring modes but was not previously documented. For the &quot;weighted&quot; mode, the value is adjusted to 0.01. For the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes it is adjusted to 0.000001. The adjustment is not applied to the Classic K-S scoring mode since the expression values are not directly used with this mode.<br /> <br /> A similar adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for Infinite or NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value, for example).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4463 GSEA v4.2.x Release Notes 2021-12-16T01:32:32Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4462 GSEA v4.2.x Release Notes 2021-12-16T01:19:59Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Fixed a bug in the calculation of the weighted_p1.5 scoring mode. If you have used this mode in the past, we recommend re-running your analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4461 GSEA v4.2.x Release Notes 2021-12-16T01:13:20Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes.<br /> * Added a new Absolute Max of Probes collapse mode.<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> * Updated to Log4J 2.16.0. Note however, we do not believe any version of GSEA Desktop is impacted by the vulnerability of earlier Log4j versions because it is a desktop application and does not expose any input forms to users over the web. '''If you are exposing GSEA through a website or other networked server then we recommend you update to 4.2.0 immediately.'''<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4460 GSEA v4.2.x Release Notes 2021-12-06T23:09:07Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes. ''Need to add a &quot;what and why&quot; explanation ...''<br /> * Added a new Absolute Max of Probes collapse mode. ''Further details to come ...''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the &quot;weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4459 GSEA v4.2.x Release Notes 2021-12-06T22:58:18Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes. ''Need to add a &quot;what and why&quot; explanation ...''<br /> * Added a new Absolute Max of Probes collapse mode. ''Further details to come''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a 'gsea.log' file in gsea_home'.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.2.x_Release_Notes&diff=4458 GSEA v4.2.x Release Notes 2021-12-06T22:57:01Z <p>Eby: Created page with '[http://www.broadinstitute.org/gsea/ GSEA Home] | [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures…'</p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.2.0 (Dec 2021)&lt;/h2&gt;<br /> <br /> The GSEA v4.2.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Added a Spearman Correlation metric for continuous phenotypes. ''Need to add a &quot;what and why&quot; explanation ...''<br /> * Added a new Absolute Max of Probes collapse mode. ''Further details to come''<br /> * Added a feature to allow saving the resulting dataset when the Collapse or Remap_Only options are set for a GSEA analysis. If the 'Create GCT files' option under Advanced Fields is set to ''true'', the dataset will be saved as a GCT in the ''edb'' sub-folder of the analysis result directory.<br /> * Modified to save the console log to a `gsea.log` file in `gsea_home`.<br /> <br /> There are also updates for better handling of missing values in the input datasets in the file parsers and computations. GSEA ignores missing values in general but there were certain situations where this was not the case. These happened primarily around missing tab fields and explicit NA or NaN input values, but there were also improvements to the handling of missing values overall.<br /> * Added more prominent warnings in the logs, the UI, and the reports when there are missing values in the input.<br /> * Modified the GCT, TXT, RNK, and PCL parsers to better handle these cases. NA values were formerly not treated as missing and would cause a numeric parsing error. Likewise for quoted empty values. These are now treated simply as missing values aand ignored.<br /> * Fixed bugs in most metric calculations where the missing values were not ignored as intended. This affected all metrics except signal-to-noise (S2N, the default) and tTest.<br /> * Fixed the collapse calculations to also ignore missing values among the individual probes in the same way as the metric computations. This can affect the calculation of mean or median, for example.<br /> <br /> Likewise, there are also updates to provide warnings about explicit infinite values in the input dataset. '''Such values can cause unexpected results during computation or plotting and are not recommended'''. Infinite values in the input will, however, be handled and used as-is in the metric computations.<br /> <br /> Infinite values '''coming out of''' the metric computations will be adjusted to 0.01 when using the various &quot;weighted&quot; scoring modes, to avoid interfering with the rest of the enrichment results and any subsequent reporting. This has the effect of ''de-emphasizing'' that particular gene in any scoring. <br /> <br /> This adjustment has historically been applied to the &quot;weighted&quot; scoring mode but was not previously documented; it has been extended to the weighted_p1.5&quot; and &quot;weighted_p2&quot; modes. It is not applied to the Classic K-S scoring mode since the expression values are note directly used with this mode.<br /> <br /> This adjustment is also made to infinite values during plotting to avoid errors from the charting library being unable to render such values.<br /> <br /> Warnings are also provided for NaN values coming out of metric computations (resulting from division-by-zero or taking the root of a negative value).<br /> <br /> The vast majority of datasets should be unaffected by these changes as such values should be relatively rare. If you have run analyses on datasets with missing, NA, NaN, or Infinite values and are concerned about changes to the results, we recommend re-running the analysis with GSEA 4.2.0 to evaluate the possible differences.<br /> <br /> Beyond that, there are a number of miscellaneous improvements and bug fixes. Chief among these are:<br /> * Changed the FDR q-value scale on the NES vs Significance plot. This was formerly 0-100 but has been changed to 0.0-1.0 to match the values in the report table.<br /> * Added minimum-sample warnings and errors for the continuous phenotype metrics. Fixed a bug where the minimum-sample check was not applied with gene_set permutation mode.<br /> * Added a warning about use of the FDR when only one gene set is being analyzed. Reported FDRs are not an accurate representation of the actual false discovery rate when derived from a single gene set.<br /> * Modified the launcher scripts to fix some issues with recent Java 11 releases on newer versions of macOS and to better support symlinks on Mac and Linux.<br /> * Fixed bugs with GMT caching and the gene set subset-select feature on Windows.<br /> * Fixed a bug with some UI parameter widgets handling empty values.<br /> * Fixed a bug where the analysis RPT file was not saved if there was an error.<br /> * Fixed a bug with GMT &amp; CHIP sorting for MSigDB point releases.<br /> * Fixed some issues with blank fields in the CHIP parser.<br /> * Fixed some bugs in the GCT &amp; TXT export functions.<br /> * Improved the error message for a missing phenotype selection.<br /> * Updated the CHIP Download link in the Help menu to use our new location.<br /> * Fixed a UI dialog-centering bug.<br /> * Added GSEA &amp; MSigDB citation info to the report.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Main_Page&diff=4456 Main Page 2021-08-30T23:38:18Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> &lt;p&gt; Use the navigation bar on the left to display documentation on GSEA software, MSigDB database or GSEA/MSigDB web site. If you have comments or questions not answered by the [[FAQ]] or the [http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html User Guide], contact us at [https://groups.google.com/group/gsea-help groups.google.com/group/gsea-help].&lt;/p&gt;<br /> <br /> &lt;ul&gt; When contacting our team with questions about java GSEA programs, please send the following information:<br /> &lt;li&gt; your computer's operation system<br /> &lt;li&gt; version of java which you used to run GSEA<br /> &lt;li&gt; detailed log transcript from the GSEA session in question<br /> &lt;p&gt; to view the log, click [+] at the bottom of main screen of GSEA java desktop application, copy the text to a separate file and attach it to your request &lt;/p&gt;<br /> &lt;/ul&gt;<br /> <br /> &lt;h2&gt;Where to start&lt;/h2&gt;<br /> &lt;p&gt; If you are new to GSEA, see the [http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp Tutorial] for a brief overview of the software. <br /> If you have a question, see the [[FAQ]] or the [http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html User Guide]. The User Guide describes how to prepare data files, load data files, run the gene set enrichment analysis, and interpret the results. It also includes instructions for running GSEA from the command line and a Quick Reference section, which describes each window of the GSEA desktop application. &lt;br /&gt;<br /> &lt;/p&gt;<br /> &lt;h3&gt;Getting started with RNA-seq and GSEA&lt;/h3&gt;<br /> The GSEA method was originally developed for analysis of microarray data. In order to best adapt this method for RNA-sequencing data sets the GSEA team has developed a [[Using_RNA-seq_Datasets_with_GSEA|collection of guidelines and suggestions which describe how to properly handle these data.]]<br /> &lt;h2&gt;MSigDB gene sets&lt;/h2&gt;<br /> &lt;p&gt; Current release of the Molecular Signatures Database ([[MSigDB_v7.4_Release_Notes|v7.4 MSigDB]]) contains 32,284 gene sets for use with GSEA. For information about MSigDB and the gene sets, see the [http://www.broadinstitute.org/gsea/msigdb MSigDB web site]. &lt;/p&gt;<br /> &lt;p&gt; Please note that gene sets can change or become deprecated in subsequent releases of MSigDB. It is thus important to indicate the version of MSigDB to fully reference gene sets used in your study. &lt;/p&gt;<br /> <br /> &lt;h2&gt;Software&lt;/h2&gt;<br /> &lt;p&gt;We provide the following software implementations of the GSEA method:<br /> &lt;ul&gt;<br /> &lt;li&gt;Java desktop application -- Easy-to-use graphical interface that can be run from the [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] page. The [http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html User Guide] fully describes this application in detail.<br /> &lt;/li&gt;<br /> &lt;li&gt;Java jar file -- Command line interface that can be downloaded from the [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] page. See [http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Running_GSEA_from Running GSEA from the Command Line] in the &lt;i&gt;User Guide&lt;/i&gt; for details. This might be useful for analyzing several datasets sequentially, analyzing large datasets, or running analyses on a compute cluster.&lt;/li&gt;<br /> &lt;li&gt;R-GSEA -- R implementation of GSEA that can be downloaded from the [http://www.broadinstitute.org/gsea/downloads_archive.jsp Archived Downloads] page. This implementation is intended for experienced computational biologists who may want to explore the underlying algorithm. The [[R-GSEA_Readme|R-GSEA Readme]] provides brief instructions and support is limited. Please note that this implementation is not actively maintained or supported.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> &lt;/p&gt;<br /> &lt;p&gt;Thank you for your interest in GSEA,&lt;br&gt;<br /> The GSEA Team&lt;/p&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=MSigDB_v7.3_Release_Notes&diff=4439 MSigDB v7.3 Release Notes 2021-03-11T21:00:44Z <p>Eby: </p> <hr /> <div>&lt;span class=&quot;plainlinks&quot;&gt;<br /> [http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;/span&gt;<br /> <br /> This page describes the changes made to the gene set collections for Release 7.3 of the Molecular Signatures Database (MSigDB). This release includes a reorganization of C7 to accommodate the addition of vaccination response gene sets provided by [https://www.immuneprofiling.org/hipc/page/show the Human Immunology Project Consortium] among other minor updates and additions.<br /> <br /> &lt;b&gt;Note:&lt;/b&gt; Due to substantial changes introduced in MSigDB 7.0, using GSEA 4.0.0+ is recommended when utilizing MSigDB 7.0+ resources.&lt;br&gt;<br /> &lt;b&gt;Advisory&lt;/b&gt;: It is strongly recommended that users of MSigDB 7.3 '''always''' use the GSEA &quot;Collapse/Remap to gene symbols&quot; feature with the provided Symbol Remapping chip file if your dataset was generated with a transcriptome other than '''Ensembl v103/GENCODE v37'''.<br /> <br /> &lt;h2&gt;New Additions and Changes to Collection Organization&lt;/h2&gt;<br /> <br /> &lt;h3&gt;C2:CGP&lt;/h3&gt;<br /> Gene sets describing the molecular effect of over expression of S1PR3 in Leukemia [https://pubmed.ncbi.nlm.nih.gov/33458693/ (PMID33458693)], and signatures describing the effects of anti-TNF therapy on inflammatory bowel disease [https://pubmed.ncbi.nlm.nih.gov/33429950/ (PMID33429950)] as well as gene sets contributed by the following individuals have been added to C2:CGP<br /> &lt;ul&gt;<br /> &lt;li&gt;Jorge Benitez, University of California, San Diego - BENITEZ_GBM_PROTEASOME_INHIBITION_RESPONSE Signature, [https://pubmed.ncbi.nlm.nih.gov/33428749/ (PMID33428749)]<br /> &lt;li&gt;Martin Fischer, Leibniz Institute on Aging, Fritz Lipmann Institute - RIEGE_DELTANP63_DIRECT_TARGETS_UP Signature, [https://pubmed.ncbi.nlm.nih.gov/33263276/ (PMID33263276)]<br /> &lt;/ul&gt;<br /> <br /> &lt;h3&gt;C7: immunologic signature gene sets&lt;/h3&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;C7 has been reorganized to accommodate the addition of new data. The previous C7 collection has been moved to sub-collection level and renamed to C7:ImmuneSigDB, to reflect its original publication title.<br /> &lt;li&gt; A new sub-collection, C7:VAX has been added to C7. This sub-collection consists of 347 gene sets curated from the literature by [https://www.immuneprofiling.org/hipc/page/show the Human Immunology Project Consortium (HIPC)]. These sets describe the human immunological responses to specific vaccines. Sets in this collection include signatures of age specific responses, post-vaccination response time-courses, and predictive signatures of responders to vaccination vs. non-responders among other curated data.<br /> &lt;/ul&gt;<br /> <br /> &lt;h3&gt;C8: cell type signature gene sets&lt;/h3&gt;<br /> <br /> 333 Gene sets of single-cell sequencing derived cell identity signatures have been added to C8. These consist of:<br /> &lt;ul&gt;<br /> &lt;li&gt;19 gene sets of Ovarian derived cell types from [https://pubmed.ncbi.nlm.nih.gov/31320652/ Fan et al. (PMID31320652)]<br /> &lt;li&gt;54 gene sets of Lung derived cell types from [https://pubmed.ncbi.nlm.nih.gov/33208946 Travaglini et al. (PMID33208946)]<br /> &lt;li&gt;11 gene sets of Skeletal Muscle derived cell types from [https://pubmed.ncbi.nlm.nih.gov/31937892 Rubenstein et al. (PMID31937892)]<br /> &lt;li&gt; 77 Global, and 172 Tissue specific cell types from the [https://descartes.brotmanbaty.org Descartes database] Human Gene Expression During Development atlas [https://pubmed.ncbi.nlm.nih.gov/33184181 (Cao et al. PMID33184181)]. <br /> &lt;/ul&gt;<br /> <br /> &lt;h3&gt;Redundant Terms Annotations&lt;/h3&gt;<br /> Gene set sub-collections updated in this release that have undergone redundancy filtering for inclusion in MSigDB now have an additional field on the gene set page &quot;Redundant Terms&quot;. This field contains the source database IDs of other candidate gene sets that clustered with the selected set, and exhibited a Jaccard coefficients &gt;0.85 with the selected set but were not selected on the basis of tree distance or set size. These database IDs link to the source resource's page for that term as in the EXTERNAL_DETAILS_URL field.<br /> <br /> &lt;h2&gt;Updates to Existing Gene Sets by Collection&lt;/h2&gt;<br /> <br /> &lt;h3&gt;C1 (positional gene sets)&lt;/h3&gt;<br /> C1 has been updated to reflect the primary assembly of the current release of the Human Genome as present in Ensembl 103 and GENCODE 37 (GRCh38). Gene annotations for this collection are derived from the ''Chromosome'' and ''Karyotype band'' tracks from the Ensembl BioMart (version 103) and reflect the gene architecture as represented on the primary assembly.<br /> <br /> &lt;h3&gt;C2:CP:Reactome&lt;/h3&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;Reactome gene sets have been updated to reflect the state of the Reactome pathway architecture as of '''Reactome v75''' (+15 gene sets).<br /> &lt;li&gt;As previously described in the [[MSigDB_v7.0_Release_Notes#C2:CP:Reactome_-_Major_overhaul | Reactome release notes for MSigDB 7.0]], in order to limit redundancy between gene sets within the Reactome sub-collection we applied a filtering procedure based on Jaccard coefficients and distance from the top level of the Reactome event hierarchy.<br /> &lt;/ul&gt;<br /> <br /> &lt;h3&gt;C2:CP:WikiPathways&lt;/h3&gt;<br /> WikiPathways gene sets have been updated to reflect the state of WikiPathways Release 20210310 (+28 gene sets).<br /> &lt;h3&gt;C3 regulatory target gene sets&lt;/h3&gt;<br /> <br /> C3:GTRD has been updated to GTRD v20.06 (+175 gene sets), this additionally corrects an error where data from certain transcription factors with short promoter regions may have been omitted.<br /> <br /> &lt;h3&gt;C5:GO (Gene Ontology)&lt;/h3&gt;<br /> &lt;p&gt; Gene sets in these sub-collections are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (&lt;span class=&quot;plainlinks&quot;&gt;[http://www.geneontology.org Nature Genet 2000]&lt;/span&gt;). The gene sets are named by GO term and contain genes annotated by that term. This collection has been updated to the most recent GO annotations as present in the GO-basic obo file released on 2021-02-01 and NCBI gene2go annotations downloaded on 2021-02-16.&lt;/p&gt;<br /> &lt;p&gt;This collection is divided into three sub-collections:&lt;/p&gt;<br /> &lt;ul&gt;<br /> &lt;li&gt;&lt;strong&gt;BP&lt;/strong&gt;: GO Biological process (-94 gene sets). Gene sets derived from the Biological Process Ontology.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;CC&lt;/strong&gt;: GO Cellular component (-5 gene sets). Gene sets derived from the Cellular Component Ontology.&lt;/li&gt;<br /> &lt;li&gt;&lt;strong&gt;MF&lt;/strong&gt;: GO Molecular function (+11 gene sets). Gene sets derived from the Molecular Function Ontology.&lt;/li&gt;<br /> &lt;/ul&gt;<br /> <br /> Gene sets in GO sub-collection previously had the universal prefix &quot;GO_&quot;, this prefix has been updated to be sub-collection specific. Gene sets in GO:BP now begin with &quot;GOBP_&quot;, GO:CC now beign wiht &quot;GOCC_&quot;, and GO:MF now begin with &quot;GOMF_&quot;. This change should enable better &quot;at a glance&quot; determinations of which GO sub-collection was the origin of a specific gene set hit in analysis pipelines.<br /> <br /> &lt;p&gt;These updates were generated in accordance with the procedure described in the [[MSigDB_v7.0_Release_Notes#C5_.28Gene_Ontology_collection.29_-_Major_overhaul | GO release notes for MSigDB 7.0.]]<br /> <br /> &lt;h3&gt;C5:HPO (Human Phenotype Ontology)&lt;/h3&gt;<br /> <br /> Gene sets in this sub-collection have been updated to reflect the 2021-02-09 release of the Human Phenotype Ontology database (+319 gene sets). This sub-collection has been redundancy filtered through a procedure comparable to that of the GO and Reactome sub-collections.<br /> <br /> &lt;h3&gt;CHIP file updates&lt;/h3&gt;<br /> <br /> All CHIP files previously provided in the standard MSigDB 7.2 release have been updated for MSigDB 7.3 in accordance with previously described procedures.<br /> <br /> Gene orthology annotations for mapping mouse and rat genes to their best match human orthologs have been updated to &lt;span class=&quot;plainlinks&quot;&gt;[https://www.alliancegenome.org/ Alliance of Genome Resources] orthology database release 3.2.</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=Release_Notes&diff=4402 Release Notes 2020-07-30T21:15:31Z <p>Eby: </p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]<br /> &lt;br&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;GSEA Software Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table width=&quot;700&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; height=&quot;78&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2020&lt;/td&gt;<br /> &lt;td&gt;4.1.&lt;em&gt;x&lt;/em&gt;*&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Improved support for macOS Catalina, updated and improved Enrichment Reports, and numerous bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.1.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019 - Nov 2019&lt;/td&gt;<br /> &lt;td&gt;4.0.&lt;em&gt;x&lt;/em&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Updates for MSigDB 7.0, Java 11 compatibility, and better performance&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v4.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jul 2017&lt;/td&gt;<br /> &lt;td&gt;3.0&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;Open source release, with numerous improvements and bug fixes&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015 - Apr 2017&lt;/td&gt;<br /> &lt;td&gt;2.2.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.2.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;2.1.0&lt;/td&gt;<br /> &lt;td&gt;Added Enrichment Map visualization of GSEA results&lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.1.0._Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2007 - Jan 2014&lt;/td&gt;<br /> &lt;td&gt;2.0.x&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[GSEA_v2.0.x_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2005&lt;/td&gt;<br /> &lt;td&gt;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;MSigDB Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;83&quot; width=&quot;637&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Mar 2020&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.1*&lt;/td&gt;<br /> &lt;td&gt;C2 (+28); C3 (+2904); C5(+196)&lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;7.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-27); C2 (+738); C5 (+4079); &lt;br&gt;<br /> &lt;td&gt;[[MSigDB_v7.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jul 2018&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+24)&lt;br&gt;<br /> [[Mapping_between_v6.2_and_v6.1_gene_sets|Mapping between v6.2 and v6.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Oct 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+7)&lt;br&gt;<br /> [[Mapping_between_v6.1_and_v6.0_gene_sets|Mapping between v6.1 and v6.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;6.0&lt;/td&gt;<br /> &lt;td&gt;C2 (+2); C5 (-249)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v6.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt; <br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.2&lt;/td&gt;<br /> &lt;td&gt;C2 (+4); C5 (+4,712)&lt;br&gt;<br /> [[Mapping_between_v5.2_and_v5.1_gene_sets|Mapping between v5.2 and v5.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.2_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;tr valign=&quot;top&gt;<br /> &lt;td&gt;Jan 2016&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1); C7 (+2,962)&lt;/td&gt;<br /> &lt;td&gt; [[MSigDB_v5.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;5.0&lt;/td&gt;<br /> &lt;td&gt;H (+50); C2 (+3)&lt;br&gt;<br /> [[Mapping_between_v5.0_and_v4.0_gene_sets|Mapping between v5.0 and v4.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v5.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;May 2013&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;4.0&lt;/td&gt;<br /> &lt;td&gt;C2 (-128); C7 (+1,910)&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v4.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.1&lt;/td&gt;<br /> &lt;td&gt;C2 (+1,578); C4 (-23); C6 (+189)&lt;br&gt;<br /> [[Mapping_between_v3.1_and_v3.0_gene_sets|Mapping between v3.0 and v3.1 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.1_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Sept 2010&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;3.0&lt;/td&gt;<br /> &lt;td&gt;C1 (-60); C2 (+1,380); C3 (-1); C4 (-2) &lt;br&gt;<br /> [[Msigdb_mapping_v2.5_to_v3|Mapping between v2.5 and v3.0 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v3.0_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;April 2008&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.5&lt;/td&gt;<br /> &lt;td&gt; C2 (+205); C4 (+456); C5 (+1454) &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[[MSigDB_v2.5_Release_Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Feb 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.1&lt;/td&gt;<br /> &lt;td&gt;Minor updates to MSigDB v2.0 annotations &lt;/td&gt;<br /> &lt;td&gt;&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Jan 2007&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;2.0&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+269); C3 (+214) &lt;br /&gt;<br /> [[Msigdb_mapping_v1_to_v2|Mapping between v1 and v2 gene sets]]&lt;/td&gt;<br /> &lt;td&gt;[[Msigdb_may_2006_release_notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;Nov 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.1&lt;/td&gt;<br /> &lt;td&gt;C1 (updated); C2 (+350); C3 (+566); C4&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt;[http://www.broadinstitute.org/gsea/doc/msigdb_nov_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr valign=&quot;top&quot;&gt;<br /> &lt;td&gt;March 2005&lt;/td&gt;<br /> &lt;td&gt;&amp;nbsp;1.0&lt;/td&gt;<br /> &lt;td&gt;Initial release &lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;td&gt; [http://www.broadinstitute.org/gsea/doc/msigdb_march_2005_release_notes.pdf pdf]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> &lt;h3&gt;&lt;font color=&quot;#3366ff&quot;&gt;Web site Release Notes&lt;/font&gt;&lt;/h3&gt;<br /> &lt;table height=&quot;78&quot; width=&quot;639&quot; cellspacing=&quot;1&quot; cellpadding=&quot;1&quot; border=&quot;0&quot; align=&quot;&quot; summary=&quot;&quot;&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;&lt;strong&gt;Date&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;<br /> &lt;td&gt;&lt;strong&gt;Release Notes&lt;/strong&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Aug 2019&lt;/td&gt;<br /> &lt;td&gt;6.4*&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v7.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v6.4 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2017&lt;/td&gt;<br /> &lt;td&gt;6.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v6.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Mar 2015&lt;/td&gt;<br /> &lt;td&gt;5.0&lt;/td&gt;<br /> &lt;td&gt;Upgrades to support v5.0 MSigDB.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jun 2014&lt;/td&gt;<br /> &lt;td&gt;4.05&lt;/td&gt;<br /> &lt;td&gt;Several minor updates&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Apr 2013&lt;/td&gt;<br /> &lt;td&gt;3.87&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.87 Release Notes|wiki]]&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Oct 2012&lt;/td&gt;<br /> &lt;td&gt;3.84&lt;/td&gt;<br /> &lt;td&gt;Several updates and new functionality.&lt;/td&gt;<br /> &lt;td&gt;&lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;tr&gt;<br /> &lt;td&gt;Jan 2011&lt;/td&gt;<br /> &lt;td&gt;3.5&lt;/td&gt;<br /> &lt;td&gt;Several bug fixes and some new functionality.&lt;/td&gt;<br /> &lt;td&gt;[[Web site v3.4 Release Notes|wiki]]&lt;br /&gt;<br /> &lt;/td&gt;<br /> &lt;/tr&gt;<br /> &lt;/table&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;hr&gt;<br /> * Current release &lt;br /&gt;</div> Eby https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php?title=GSEA_v4.1.x_Release_Notes&diff=4401 GSEA v4.1.x Release Notes 2020-07-30T19:51:07Z <p>Eby: Created page with '[http://www.broadinstitute.org/gsea/ GSEA Home] | [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures…'</p> <hr /> <div>[http://www.broadinstitute.org/gsea/ GSEA Home] |<br /> [http://www.broadinstitute.org/gsea/downloads.jsp Downloads] | <br /> [http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] | <br /> [http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |<br /> [http://www.broadinstitute.org/gsea/contact.jsp Contact]&lt;br /&gt;<br /> <br /> &lt;br /&gt;<br /> &lt;h2&gt; GSEA Desktop v4.1.0 (Jul 2020)&lt;/h2&gt;<br /> <br /> The GSEA v4.1.0 release includes a number of improvements and bug fixes, including:<br /> <br /> * Improved support for macOS Catalina and above, including updates to file choosers and notarization of the Mac application with Apple. This fixes issues with installation and with access to certain user directories considered restricted by Apple (Documents, Desktop, Downloads, network drives, and removable media). The file chooser updates apply to Windows and Linux as well.<br /> * Updated the Enrichment Report to be more reflective of modern use with RNA-Seq data and newer versions of MSigDB (7.0+). <br /> * Added the Preranked analysis for annotated reports when Collapse Dataset is used with an annotated CHIP file (in either &lt;em&gt;Collapse&lt;/em&gt; or &lt;em&gt;Remap_Only&lt;/em&gt; mode).<br /> * Added a Symbol Mapping Report to the GSEA and Preranked analyses when Collapse Dataset is used (in either &lt;em&gt;Collapse&lt;/em&gt; or &lt;em&gt;Remap_Only&lt;/em&gt; mode).<br /> * Added a Purge Selected Files feature to the Recent Files list on the Load Data screen, accessible through the right-click / control-click context menu.<br /> * Switched all Excel files in the reports to TSV (tab-separated values) files. This addresses an issue with newer versions of Excel complaining about the format and security of the report files: the files previously produced were actually just TSVs with a &quot;.xls&quot; extension, causing Excel to complain about corrupted files.<br /> <br /> Miscellaneous bug fixes:<br /> * Fixed a possible floating-point rounding bug in the &lt;em&gt;mu&lt;/em&gt; and &lt;em&gt;sigma&lt;/em&gt; adjustment of the standard-deviation calculation used for the Signal2Noise and tTest metrics.<br /> * Changed the Recent Files list on the Load Data screen to track least-recently-used files for removal when the file limit is reached.<br /> * Restored the -param_file command-line option, which had been inadvertently removed in the 4.0.0 release.<br /> * Fixed an issue with the Save Dataset menu option in the Leading Edge report.<br /> * Fixed an issue with the Exit Handler on macOS.<br /> * Fixed an issue with the application launcher on certain versions of Windows 10.<br /> * Minor UI tweaks to cope with HiDPI screens, macOS Dark Mode (note: &lt;strong&gt;not&lt;/strong&gt; supported), screen layout, etc.<br /> * Numerous points of internal code simplification and modernization, removal of unused code, etc.</div> Eby