Difference between revisions of "MSigDB SQLite Database"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 5: Line 5:
 
[http://www.broadinstitute.org/gsea/contact.jsp Contact]
 
[http://www.broadinstitute.org/gsea/contact.jsp Contact]
 
<br>
 
<br>
 +
 
<h2>Introduction</h2>
 
<h2>Introduction</h2>
 +
<p>
 
With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client.  No other downloads are necessary.  This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database.  Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc.  In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.
 
With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client.  No other downloads are necessary.  This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database.  Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc.  In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.
 
+
</p>
 +
<p>
 
Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.
 
Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.
 
+
</p>
 +
<p>
 
Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.
 
Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.
 
+
</p>
 +
<p>
 
The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp
 
The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp
 +
</p>
  
 
<h2>Database Design</h2>
 
<h2>Database Design</h2>
 
<h3>Design Considerations</h3>
 
<h3>Design Considerations</h3>
 +
<p>
 
The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.
 
The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.
 
+
</p>
 +
<p>
 
Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use.  Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking.  It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.
 
Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use.  Thus, each database file will hold only '''ONE''' MSigDB release for '''ONE''' resource, either Human or Mouse, with very little in the way of history tracking.  It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.
 
+
</p>
 +
<p>
 
This schema is designed to be a read-only resource.  After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers.  These simply do not apply other than during initial creation.  SQLite has no issues around multiple concurrent readers.
 
This schema is designed to be a read-only resource.  After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers.  These simply do not apply other than during initial creation.  SQLite has no issues around multiple concurrent readers.
 
+
</p>
<h2>Schema</h2>
+
<h3>Schema</h3>
 +
<p>
 
Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.
 
Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.
 +
</p>

Revision as of 20:10, 23 March 2023

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Introduction

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The license terms for MSigDB are available here: http://www.gsea-msigdb.org/gsea/msigdb_license_terms.jsp

Database Design

Design Considerations

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.