Difference between revisions of "MSigDB SQLite Database"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 61: Line 61:
 
<ul>
 
<ul>
 
   <li>The ''gene_set_details'' table gives a variety of additional details for each gene set.  It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.<br/>
 
   <li>The ''gene_set_details'' table gives a variety of additional details for each gene set.  It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.<br/>
Some columns of note:
+
Here are some columns of note:
 
     <ul>
 
     <ul>
 
       <li>While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target.  This is given in the ''source_species_code'' column.</li>
 
       <li>While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target.  This is given in the ''source_species_code'' column.</li>
Line 75: Line 75:
 
While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.</li>
 
While these older records are not currently referenced, they are included to cover the future intent to add revision history in the ''added_in_MSigDB_id'' and ''changed_in_MSigDB_id'' columns of the ''gene_set_details'' table as mentioned earlier.</li>
 
   <li>The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself.  Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.</li>
 
   <li>The ''namespace'' and ''species'' tables allow us to label ''source_member'' and ''gene_symbol'' records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself.  Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.</li>
   <li>We associate publication and author info to gene sets through the correspondingly-named tables (joined by ''publication_author'').  Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID).  This is imperfect, however, as there are cases of distinct authors with identical names.  Our information here is only as good as PubMed allows it to be.  Be sure to reference the '''publication itself''' for the most accurate authorship info.<br/>
+
   <li>The ''publication'' and ''author'' tables associate publication info to gene sets (joined by ''publication_author'').  Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID).  This is imperfect, however, as there are cases of distinct authors with identical names.  Our information here is only as good as PubMed allows it to be.  Be sure to reference the '''publication itself''' for the most accurate authorship info.<br/>
 
There are a few cases of gene sets with author info but without an associated publication in PubMed.  These are represented through "placeholder" publication records with titles like "Placeholder publication for M2872,M2873", where the identifiers at the end are the  systematic_name(s) of the corresponding gene set.</li>
 
There are a few cases of gene sets with author info but without an associated publication in PubMed.  These are represented through "placeholder" publication records with titles like "Placeholder publication for M2872,M2873", where the identifiers at the end are the  systematic_name(s) of the corresponding gene set.</li>
 
</ul>
 
</ul>
 
</p>
 
</p>

Revision as of 21:52, 23 March 2023

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Introduction

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The License Terms for MSigDB are available on our website.

Database Design

Design Considerations

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.

Msigdb release.png

Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and will not be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.

The core (blue) tables:

  • The gene_set table holds the core information about each gene set. Note that the collection_name and license_code columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.
    • The tags column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.
  • The gene_symbol table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The namespace_id will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.
  • The gene_set_gene_symbol table joins the gene sets to its member gene symbols.
  • In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The source_member table contains these original gene set member identifiers (joined via gene_set_source_member).
    • The gene_symbol_id column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.
    • These tables should not be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.

The metadata (purple) tables:

  • The gene_set_details table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.
    Here are some columns of note:
    • While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the source_species_code column.
    • The external_details_URL column may actually contain multiple URLs. These will be separated by the pipe character ('|').
    • The exact_source column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.
    • While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the primary_namespace_id, secondary_namespace_id, and their count in num_namespaces. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.
    • The added_in_MSigDB_id, changed_in_MSigDB_id, and changed_reason columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.
  • The collection table holds the information for each MSigDB Collection. For convenience, the collection_name column encodes the full collection hierarchy information, in the form "C5:GO:BP" or "M2:CP:REACTOME" for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.
  • The gene_set_license table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our License Terms page for more info.
  • The MSigDB table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release. While these older records are not currently referenced, they are included to cover the future intent to add revision history in the added_in_MSigDB_id and changed_in_MSigDB_id columns of the gene_set_details table as mentioned earlier.
  • The namespace and species tables allow us to label source_member and gene_symbol records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.
  • The publication and author tables associate publication info to gene sets (joined by publication_author). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the publication itself for the most accurate authorship info.
    There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through "placeholder" publication records with titles like "Placeholder publication for M2872,M2873", where the identifiers at the end are the systematic_name(s) of the corresponding gene set.