Difference between revisions of "MSigDB SQLite Database"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 164: Line 164:
 
<pre>
 
<pre>
 
     SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'
 
     SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'
 +
</pre>
 +
</p>
 +
<p>
 +
This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:
 +
<pre>
 +
    .mode tabs
 +
    .once BRCA1_BRCA2_sets.gmt
 +
    SELECT standard_name,
 +
          (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')
 +
              ||'/'||standard_name,
 +
          group_concat(symbol, '      ')
 +
    FROM gene_set gset
 +
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
 +
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
 +
    WHERE gset.id IN
 +
      ( SELECT distinct(gene_set_id)
 +
        FROM gene_set_gene_symbol gsgs2
 +
          INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id
 +
        WHERE symbol in ('BRCA1', 'BRCA2') )
 +
    GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500
 +
    ORDER BY standard_name ASC;
 
</pre>
 
</pre>
 
</p>
 
</p>

Revision as of 21:21, 23 March 2023

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Introduction

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The License Terms for MSigDB are available on our website.

Database Design

Design Considerations

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.

Msigdb release.png

Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and will not be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.

The core (blue) tables:

  • The gene_set table holds the core information about each gene set. Note that the collection_name and license_code columns are denormalized for ease of use; these hold the name of the MSigDB collection and its license respectively.
    • The tags column is unused at present and reserved for future use. It may be removed in the future in favor of a more structured alternative for providing tag metadata.
  • The gene_symbol table holds the canonical information for the genes found in MSigDB gene sets, including both the official symbol (HUGO for Human MSigDB, MGI for Mouse) and the NCBI (formerly Entrez) Gene ID. The namespace_id will be constant across a given database as all symbols are mapped into the same namespace for a particular release of MSigDB.
  • The gene_set_gene_symbol table joins the gene sets to its member gene symbols.
  • In addition to the canonical gene symbols, which are in the same namespace across all gene sets in an MSigDB release, all gene sets include the gene identifiers of its members as specified by the original source of the gene set. This original source will commonly be a publication, for example, or some broader resource like Reactome or Gene Ontology. The source_member table contains these original gene set member identifiers (joined via gene_set_source_member).
    • The gene_symbol_id column gives the mapping to our uniformly mapped gene symbols. We provide a set of external CHIP files encoding the same information which will usually be more convenient to use, however.
    • These tables should not be used when using the database to extract gene sets for custom gene set files for use with GSEA and other analysis tools as the source identifiers will not have a uniform namespace, may conflict with one another, and may not even have a valid mapping in modern namespaces. These tables are meant for informational purposes only.


The metadata (purple) tables:

  • The gene_set_details table gives a variety of additional details for each gene set. It is essentially an extension of the core gene_set table - and uses the same primary key - but is kept separate in order to simplify the core table.
    Here are some columns of note:
    • While each database of MSigDB is targeted at a particular species (Human or Mouse), the members of a given gene set may have originated in a different species than the target. This is given in the source_species_code column.
    • The external_details_URL column may actually contain multiple URLs. These will be separated by the pipe character ('|').
    • The exact_source column holds information on finding the source of the gene set from wherever it originated. For external resources like Reactome or Gene Ontology this is frequently an identifier defined by the resource itself (e.g. R-HSA-156588) which can be used to look up further details on that resource's website. The column can also hold free-text listing e.g. a figure, section or supplementary document from a publication.
    • While we now require all new gene sets to consist of members from a single namespace, some older sets contain members from a mix of namespaces. These are found in the primary_namespace_id, secondary_namespace_id, and their count in num_namespaces. For the relatively few cases where there are more than two, any additional namespaces can be found by iterating through the linked source members.
    • The added_in_MSigDB_id, changed_in_MSigDB_id, and changed_reason columns are unused at present and reserved for future use. They are intended to hold MSigDB revision history.
  • The collection table holds the information for each MSigDB Collection. For convenience, the collection_name column encodes the full collection hierarchy information, in the form "C5:GO:BP" or "M2:CP:REACTOME" for example. There is also a fully recursive hierarchy encoded in the table but we expect few users to need this.
  • The gene_set_license table allows us to associate licensing info with each gene set. The vast majority are Creative Commons Attribution 4.0 International (CC-BY-4.0); see our License Terms page for more info.
  • The MSigDB table gives information about the database as a whole. It contains information about the date of release, the mapping information used (where available), the target species, etc. There are records covering all versions of MSigDB going back from the current version to the original 1.0 release. While these older records are not currently referenced, they are included to cover the future intent to add revision history in the added_in_MSigDB_id and changed_in_MSigDB_id columns of the gene_set_details table as mentioned earlier.
  • The namespace and species tables allow us to label source_member and gene_symbol records to identify the mapping info associated with each (that is, what kind of identifier or symbol we have), as well as the overall target species of MSigDB itself. Note again that the source identifier of a particular gene set member might differ from the MSigDB target species.
  • The publication and author tables associate publication info to gene sets (joined by publication_author). Where possible, we have extracted the author name info from PubMed based on the PubMed ID (PMID). This is imperfect, however, as there are cases of distinct authors with identical names. Our information here is only as good as PubMed allows it to be. Be sure to reference the publication itself for the most accurate authorship info.
    There are a few cases of gene sets with author info but without an associated publication in PubMed. These are represented through "placeholder" publication records with titles like "Placeholder publication for M2872,M2873", where the identifiers at the end are the systematic_name(s) of the corresponding gene set.


The "external item" (gray) tables:

  • When mining external resources for gene sets, e.g., Reactome, Gene Ontology, Human Phenotype Ontology, we sometimes find that the resulting collection would contain multiple gene sets that are too similar if we include them all. We apply a redundancy filtering procedure and select a single representative of similar candidate gene sets and exclude the others. MSigDB’s online gene set page of a selected gene set includes information about any related candidate gene sets that were excluded, linking out to details on the external resource’s website. The gray tables external_term and external_term_filtered_by_similarity contain this information.

Example Queries

The examples given here assume we are working with the MSigDB Human database from our Downloads page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.

These examples also assume the use of the official SQLite command line shell to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.

Extracting gene sets in the GMT format

One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:

    .mode tabs
    .once wikipathways.gmt
    SELECT standard_name 'na', group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE collection_name = 'C2:CP:WIKIPATHWAYS'
    GROUP BY standard_name ORDER BY standard_name ASC;

The basic template for creating GMTs is as follows:

    .mode tabs
    .once <filename>
    SELECT standard_name 'na', group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE <selection criteria>
    GROUP BY standard_name ORDER BY standard_name ASC;

Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the group_concat function is a quoted tab character.

Finding gene sets containing one or more specified genes

Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:

    SELECT distinct(standard_name)
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id
      INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id
    WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;

    AAAYWAACM_HFH4_01
    ACTAYRNNNCCCR_UNKNOWN
    ACTGAAA_MIR30A3P_MIR30E3P
    ARID3B_TARGET_GENES
    ASH1L_TARGET_GENES
    <...etc...>

Extracting gene sets and their metadata

This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:

    .mode tabs
    .once wikipathways_threshold.gmt
    SELECT standard_name,
           ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )
               ||'/'||standard_name,
           group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE collection_name = 'C2:CP:WIKIPATHWAYS'
    GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500
    ORDER BY standard_name ASC;

Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:

    SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'

This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:

    .mode tabs
    .once BRCA1_BRCA2_sets.gmt
    SELECT standard_name,
           (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')
               ||'/'||standard_name,
           group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE gset.id IN
      ( SELECT distinct(gene_set_id)
        FROM gene_set_gene_symbol gsgs2
          INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id
        WHERE symbol in ('BRCA1', 'BRCA2') )
    GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500
    ORDER BY standard_name ASC;