MSigDB SQLite Database

From GeneSetEnrichmentAnalysisWiki
Revision as of 20:17, 23 March 2023 by Eby (talk | contribs)
Jump to navigation Jump to search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Introduction

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp

Database Design

Design Considerations

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.

Msigdb release.png

Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and will not be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.