Difference between revisions of "MSigDB SQLite Database"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 42: Line 42:
 
</p>
 
</p>
 
<p>
 
<p>
 +
The core (blue) tables:
 +
<ul>
 +
  <li>The gene_set table holds the core information about each gene set.  Note that the collection_name and license_code columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.
 +
    <ul><li>The tags column is unused at present and reserved for future use.  It may be removed in the future in favor of a more structured alternative for providing tag metadata.</li></ul>
 +
  </li>
 +
  <li>The gene_symbol table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID.  The namespace_id will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.</li>
 +
  <li>The gene_set_gene_symbol table joins the gene sets to its member gene symbols.</li>
 +
  <li>In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The source_member table contains these original gene set member identifiers (joined via gene_set_source_member).
 +
    <ul>
 +
      <li>The gene_symbol_id column gives the mapping to our uniformly mapped gene symbols.  We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.</li>
 +
      <li>These tables '''should not''' be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools  as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces.  These tables are meant for informational purposes only.</li>
 +
    </ul>
 +
  </li>
 +
</ul>
 +
</p>

Revision as of 21:24, 23 March 2023

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Introduction

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp

Database Design

Design Considerations

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.

Msigdb release.png

Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and will not be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.

The core (blue) tables:

  • The gene_set table holds the core information about each gene set. Note that the collection_name and license_code columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.
    • The tags column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.
  • The gene_symbol table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The namespace_id will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.
  • The gene_set_gene_symbol table joins the gene sets to its member gene symbols.
  • In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The source_member table contains these original gene set member identifiers (joined via gene_set_source_member).
    • The gene_symbol_id column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.
    • These tables should not be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.