Using RNA-seq Datasets with GSEA
GSEA requires as input an expression dataset, which contains expression profiles for multiple samples. While the software supports multiple input file formats for these datasets, the tab-delimited GCT format is the most common. The first column of the GCT file contains feature identifiers (gene ids or symbols in the case of data derived from RNA-Seq experiments). The second column contains a description of the feature; this column is ignored by GSEA and may be filled with “NA”s. Subsequent columns contain the expression values for each feature, with one sample's expression value per column. It is important to note that there are no hard and fast rules regarding how a GCT file's expression values are derived. The important point is that they are comparable to one another across features within a sample and comparable to one another across samples. RNA-seq quantification pipelines typically produce quantifications containing one or more of the following:
- Counts/Expected Counts
- Transcripts per Million (TPM)
- FPKM/RPKM
These quantifications are not normalized for comparisons across samples. Normalizing RNA-seq quantification to support comparisons of a feature's expression levels across samples is important for GSEA. Normalization methods (such as, TMM, geometric mean) which operate on raw counts data should be applied prior to running GSEA.
Note: ssGSEA (single-sample GSEA) projections perform substantially different mathematical operations from standard GSEA, for this implementation, gene-level summed TPM serves as an appropriate metric for analysis of RNA-seq quantifications.
Tools such as DESeq2 can be made to produce properly normalized data (normalized counts) which are compatible with GSEA. The DESeq2 module available through the GenePattern environment produces a GSEA compatible “normalized counts” table in the GCT format which can be directly used in the GSEA application.
Note: While GSEA can accept transcript-level quantification directly and sum these to gene-level, these quantifications are not typically properly normalized for between sample comparisons. As such, transcript level CHIP annotations are no longer provided by the GSEA-MSigDB team.
For more information on performing GSEA with RNA-seq data see: RNA-seq Data and Ensembl CHIP Files
The GSEA algorithm ranks the features listed in a GCT file. It provides a number of alternative statistics that can be used for feature ranking. But in all cases (or at least in the cases where the dataset represents expression profiles for differing categorical phenotypes) the ranking statistics capture some measure of genes' differential expression between a pair of categorical phenotypes. While these metrics are widely used for RNA-seq datasets, the GSEA team has yet to fully evaluate whether these ranking statistics, originally selected for their effectiveness when used with Microarray-based expression data, are entirely appropriate for use with data derived from RNA-seq experiments.
As an alternative to standard GSEA, analysis of data derived from RNA-seq experiments may also be conducted through the GSEAPreranked tool.
In particular:
- Prior to conducting gene set enrichment analysis, conduct your differential expression analysis using any of the tools developed by the bioinformatics community (e.g., cuffdiff, edgeR, DESeq, etc).
- Based on your differential expression analysis, rank your features and capture your ranking in an RNK-formatted file. The ranking metric can be whatever measure of differential expression you choose from the output of your selected DE tool. For example, cuffdiff provides the (base 2) log of the fold change.
- Run GSEAPreranked, if the exact magnitude of the rank metric is not directly biologically meaningful select "classic" for your enrichment score (thus, not weighting each gene's contribution to the enrichment score by the value of its ranking metric).
Please note that if you choose to use any of the gene sets available from MSigDB in your analysis, you need to make sure that the features listed in your RNK file are genes, and the genes are identified by their HUGO gene symbols. All gene symbols listed in the RNK file must be unique, match the ENSEMBL version used in the targeted version of MSigDB, and we recommend the values of the ranking metrics be unique.