RNA-Seq Data and Ensembl CHIP files
A GSEA analysis requires three different types of input data: a gene expression dataset in GCT format, the corresponding sample annotations in CLS format, and a collection of gene sets in GMT format. GSEA is typically used with gene sets from the Molecular Signatures Database (MSigDB), which consist of HUGO human gene symbols. However, gene expression data files may use other types of identifiers, depending on how the data were produced. To proceed with the analysis, GSEA converts the identifiers found in the data file to match the human symbols used in the gene set files. The conversion is performed using a CHIP file that provides the mapping between the two types of identifiers. Over the years, we have been providing CHIP files for all major microarray platforms. For example, we have CHIP files that list the mappings between Affymetrix probe set IDs and human genome symbols.
In RNA-Seq, gene expression is quantified by counting the number of sequencing reads that aligned to a genomic range, according to a reference genome assembly or transcript annotations. The majority of tools use Ensembl reference annotations for this purpose. To facilitate GSEA analysis of RNA-Seq data, we now also provide CHIP files to convert human and mouse Ensembl IDs to HUGO gene symbols. Ensembl annotation uses a system of stable IDs that have prefixes based on the species name plus the feature type, followed by a series of digits and a version, e.g.,
ENSG00000139618.1. The new GSEA Ensembl CHIP files provide mappings for human gene and transcript identifiers (i.e., Ensembl IDs with prefixes ENSG and ENST), and for mouse gene and transcript identifiers (i.e., Ensemble IDs with prefixes ENSMUSG and ENSMUST). Although transcript-level CHIP files are provided, it is not recommended to perform GSEA on transcript-level quantification data as GSEA lacks the functions necessary to properly sum transcript counts to gene level.
To run GSEA with gene expression data specified with Ensembl identifiers:
- Prepare the GCT gene expression file such that identifiers are in the form of Ensembl IDs, but without the version suffix, e.g.,
- For RNA-Seq data, you will need normalize and filter out low count measurements, and perform other preprocessing as needed. Consult your local bioinformatician for help if unsure.
- Load the GCT and corresponding CLS files into GSEA.
- Choose gene sets to test - we usually recommend starting with the Hallmarks collection.
- Choose the CHIP file that matches the identifiers in the GCT file:
ENSEMBL_human_gene.chip=> Ensembl ID prefix ENSG
ENSEMBL_human_transcript.chip=> Ensembl ID prefix ENST
ENSEMBL_mouse_gene.chip=> Ensembl ID prefix ENSMUSG
ENSEMBL_mouse_transcript.chip=> Ensembl ID prefix ENSMUST
We have also added the gene-level Ensembl IDs to the website for use with the Investigate Gene Sets tools such as Compute Overlaps. As noted above, it is necessary to remove the version suffix from any supplied IDs. At this time, transcript-level identifiers are not accepted.