FAQ
Contents
- 1 GSEA Algorithm
- 1.1 What is the difference between GSEA and an overlap statistic (hypergeometric) analysis tool?
- 1.2 Why does GSEA use the Kolmogorov-Smirnov statistic rather than the Mann-Whitney test?
- 1.3 How does GSEA rank the genes in my dataset?
- 1.4 Can I use GSEA to analyze my own ranked list of genes?
- 1.5 Can I use GSEA to compare two datasets?
- 1.6 Can I use GSEA to analyze a dataset that contains a single sample?
- 1.7 Can I use GSEA to find pathways that correlate to the expression of my favorite gene?
- 1.8 How do I cite GSEA?
- 2 GSEA Data Files
- 2.1 How do I create an expression dataset file? What types of expression data can I analyze?
- 2.2 How do I filter or pre-process my dataset for GSEA?
- 2.3 How many samples do I need for GSEA?
- 2.4 How do I create a phenotype label file? What types of experiments can I analyze?
- 2.5 What gene sets are available? Can I create my own gene sets?
- 2.6 How many genes should there be in a gene set?
- 2.7 Can GSEA analyze a gene set that contains duplicate genes? duplicate gene sets?
- 2.8 Can GSEA analyze a gene set that contains genes that are not in my expression dataset?
- 2.9 How do I translate gene identifiers from one array platform to another?
- 2.10 What array platforms does GSEA support? What can I do if GSEA does not support my platform?
- 3 GSEA Results
- 3.1 Where are the GSEA statistics (ES, NES, FDR, FWER, nominal p value) described?
- 3.2 Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?
- 3.3 Why does GSEA give me significant results with gene set (tag) permutation, but not with phenotype permutation?
- 3.4 What should I do if I have no significant gene sets or too many significant gene sets?
- 3.5 What does it mean for a gene set to have a nominal p value of zero?
- 3.6 What does it mean for a gene set to have a small nominal p value (p<0.025), but a high FDR value (FDR=1)?
- 3.7 What is the difference between the weighted statistic and the classic statistic? Which should I use?
- 3.8 Why are my results different from yours when I analyze the example datasets using GSEA?
- 3.9 Why do some of the example datasets contain negative expression values?
- 4 MSIGDB Gene Sets
- 4.1 What is MSigDB?
- 4.2 What is the difference between gene sets in MSigDB and GO/BioCarta/GenMAPP?
- 4.3 Does MSigDB include pathway diagrams?
- 4.4 Does MSigDB include GO gene sets?
- 4.5 Why do some MSigDB gene sets have the same gene represented multiple times?
- 4.6 How do I use your gene sets to analyze data from my favorite array platform?
- 4.7 How do I find out more information about a particular MSigDB gene set?
- 4.8 Can I use the MSigDB gene sets without using GSEA?
- 4.9 Is MSigDB available as a web service?
- 5 GSEA Software
- 5.1 How do I increase the amount of memory available to GSEA?
- 5.2 How do I run GSEA from the command line?
- 5.3 How do I add GSEA to my microarray analysis pipeline?
- 5.4 Do I have to be connected to the internet to run GSEA software?
- 5.5 What version of R do I need for GSEA software?
- 5.6 What version of java do I need for GSEA software?
- 5.7 Does GSEA have a programmatic API?
- 5.8 How do I create the input files for GSEA in R?
GSEA Algorithm
What is the difference between GSEA and an overlap statistic (hypergeometric) analysis tool?
An overlap statistic analysis tool typically uses a threshold to define genes as members at the top or bottom of a ranked list of genes. In contrast GSEA uses the list rank information without using a threshold. The introduction to the Gene Set Enrichment Analysis PNAS paper discusses the limitations of the former approach and how GSEA addresses them.
Why does GSEA use the Kolmogorov-Smirnov statistic rather than the Mann-Whitney test?
The Kolmogorov-Smirnov statistic is slightly more suitable for less coherent data because it takes relatively fewer significant items to score well. The Gene Set Enrichment Analysis PNAS paper discusses the use of this statistic in detail (see the section titled Adjusting for Variation in Gene Set Size in the supplemental information).
How does GSEA rank the genes in my dataset?
By default, GSEA uses the signal-to-noise metric to rank the genes. Optionally, use the Metric for ranking genes parameter to select the ranking metric that you want GSEA to use. For more information, see the Metric for ranking genes parameter on the Run GSEA Page in the GSEA User Guide.
Can I use GSEA to analyze my own ranked list of genes?
Yes. Use the GseaPreranked analysis to run the gene set enrichment analysis against your own ranked list of genes. For more information, see GSEAPreranked Page in the GSEA User Guide.
Can I use GSEA to compare two datasets?
Yes. Create a gene set that contains the top genes from the first dataset and use GSEA to analyze that gene set against the second dataset. Similarly, create a gene set that contains the top genes from the second dataset and use GSEA to analyze that gene set against the first dataset. For example, you might analyze the top 100 genes from each dataset.
Can I use GSEA to analyze a dataset that contains a single sample?
Yes. However, GSEA has no way of ranking the genes in such a dataset. Therefore, you must rank the genes and then use GSEA to analyze the ranked list of genes. For more information, see the GSEA Preranked Page in the GSEA User Guide.
Can I use GSEA to find pathways that correlate to the expression of my favorite gene?
Yes, running GSEA with a continuous phenotype that defines the expression of your favorite gene finds gene sets that correlate to the expression of that gene (gene neighbors). You can have GSEA create the necessary phenotype for you: on the Run GSEA page, click the ... button next to the Phenotype labels parameter; when GSEA prompts you to select one or more phenotypes, click the Use a gene as the phenotype button to have GSEA create a continuous phenotype for your gene. For more information, see the Phenotype labels parameter on the Run GSEA Page in the GSEA User Guide.
How do I cite GSEA?
For information on how to cite the gene set enrichment analysis, GSEA software, and/or MSigDB, please see Gsea_Citation.
GSEA Data Files
How do I create an expression dataset file? What types of expression data can I analyze?
GSEA can be used with expression data from any source; for example, two-color ratio data, CEL files, different species, and so on. However, all expression data must first be converted into a supported ASCII tab-delimited file format, such as res, gct, or pcl file. Other formats can easily be converted to these by reformatting the header and other simple modifications, such as column renaming, done by using a standard text editor. For more information, see Preparing Data Files in the GSEA User Guide.
How do I filter or pre-process my dataset for GSEA?
While you generally do not want to filter your dataset based on expression values, you might want to minimize replicate samples and collapse probe sets. Collapsing the probe sets prevents multiple probes per gene from inflating the enrichment scores and facilitates the biological interpretation of analysis results. For more information, see Preparing Data Files in the GSEA User Guide.
How many samples do I need for GSEA?
This depends on your specific problem and data characteristics; however, as a rule of thumb, you typically want to analyze at least ten samples.
How do I create a phenotype label file? What types of experiments can I analyze?
GSEA can be used to analyze experiments of any type (including time-series, three or more classes, and so on). The phenotype labels (cls) ASCII file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. The cls file is an ASCII tab-delimited file, which you can easily create using a text editor. For more information, see Preparing Data Files in the GSEA User Guide.
What gene sets are available? Can I create my own gene sets?
You can export gene sets from the Molecular Signature Database (MSigDB) or create your own. For more information, see Preparing Data Files in the GSEA User Guide.
How many genes should there be in a gene set?
GSEA automatically adjust the enrichment statistics to account for different gene set sizes, as described in the Supplemental Information for the Gene Set Enrichment Analysis PNAS paper.
Can GSEA analyze a gene set that contains duplicate genes? duplicate gene sets?
Duplicate genes in a gene set and duplicate gene sets both effect GSEA results. GSEA automatically removes duplicate genes from each gene set, but does not check for duplicate gene sets. For more information, see Gene Sets in the GSEA User Guide.
Can GSEA analyze a gene set that contains genes that are not in my expression dataset?
The gene set enrichment analysis automatically restricts the gene sets to the genes in the expression dataset. The analysis report lists the gene sets and genes that were included and excluded from the analysis.
How do I translate gene identifiers from one array platform to another?
GSEA provides a utility, Chip2Chip, which translates gene identifiers from one platform to another (for example, from Affymetrix to Agilent). For more information, see Chip2Chip in the GSEA User Guide.
What array platforms does GSEA support? What can I do if GSEA does not support my platform?
See DNA Chip (Array) Annotations in the GSEA User Guide.
GSEA Results
Where are the GSEA statistics (ES, NES, FDR, FWER, nominal p value) described?
For brief descriptions of the statistics that appear in the GSEA analysis report, see Interpreting GSEA in the GSEA User Guide. The Gene Set Enrichment Analysis PNAS paper also describes each of these statistics: for FDR and nominal p value, see the section titled Appendix: Mathematical Description of Methods; for FWER, see the section titled FWER in the supplemental information.
Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?
An FDR of 25% indicates that the result is likely to be valid 3 out of 4 times, which is reasonable in the setting of exploratory discovery where one is interested in finding candidate hypothesis to be further validated as a results of future research. Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.
Why does GSEA give me significant results with gene set (tag) permutation, but not with phenotype permutation?
Phenotype permutation generally provides a more stringent assessment of significance and produces fewer false positives. Which permutation type you should use depends on the number of samples that you are analyzing. For more information, see the description of the Permutation type parameter on the Run GSEA Page in the GSEA User Guide.
What should I do if I have no significant gene sets or too many significant gene sets?
See Interpreting GSEA in the GSEA User Guide.
What does it mean for a gene set to have a nominal p value of zero?
A reported p value of zero (0.0) indicates an actual p-value of less than 1/number-of-permutations. For a more accurate p value, increase the number of permutations performed by the analysis. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.
What does it mean for a gene set to have a small nominal p value (p<0.025), but a high FDR value (FDR=1)?
The nominal p value estimates the significance of the observed enrichment score for a single gene set. However, when you are evaluating multiple gene sets, you must correct for multiple hypothesis testing. The FDR is the estimated probability that a gene set with a given enrichment score (normalized for gene set size) represents a false positive finding.
Generally, when your top gene sets have small nominal p values and high FDRs, it is because they are not as significant when compared with other gene sets in the empirical null distribution. This could be because you do not have enough samples, the biological signal is subtle, or the gene sets do not represent the biology in question very well. Also, the FDR is based on all gene sets; if only one of many gene sets is enriched, that gene set is likely to have a high FDR.
For more information, see Interpreting GSEA in the GSEA User Guide.
What is the difference between the weighted statistic and the classic statistic? Which should I use?
See the description of the Enrichment statistic parameter on the Run GSEA Page in the GSEA User Guide.
Why are my results different from yours when I analyze the example datasets using GSEA?
You are using a different random number generator (for sample permutation) and different seeds for that random number generator, so the resulting numbers are different.
Why do some of the example datasets contain negative expression values?
Because in older releases of Affymetrix gene chips expression values were trimmed averages over match and mismatched probes. If the mismatch probes were higher a negative number results.
MSIGDB Gene Sets
What is MSigDB?
MSigDB, the Molecular Signature Database, contains curated gene sets for use with the gene set enrichment analysis. The GSEA team has begun the critical work of populating the MSigDb with curated gene sets. Increasing the number of gene sets increases the value of this resource; therefore, the GSEA team appreciates gene set contributions and encourages users to submit their gene sets to mailto:gsea@broad.mit.edu. For information about exporting gene sets from the MSigDB, see Gene Sets in the GSEA User Guide.
What is the difference between gene sets in MSigDB and GO/BioCarta/GenMAPP?
MSigDB contains gene sets formatted for use with GSEA. MSigDB places emphasis on a genomic, unbiased approach to the definition of gene sets; therefore, an important component of MSigDB is the collection of gene sets from published expression profiles. Unlike gene sets curated from prior knowledge (such as, GO, BioCarta, and so on), experimental sets provide an unbiased readout of a biological state; experimental sets from microarray experiments reflect purely transcriptional events.
Does MSigDB include pathway diagrams?
No.
Does MSigDB include GO gene sets?
The C2 (curated) category of gene sets contains a subcategory called Ontologies, which contains the GO gene sets. For a complete description of these gene sets, see the Gene Set Cards.
Why do some MSigDB gene sets have the same gene represented multiple times?
The gene sets reflect the information in the original source and no attempt to modify the definition of a gene set is done (except for eliminating obvious gene duplications). The gene sets defined in terms of gene symbols eliminate the duplication produced by multiple probes representing the same gene.
How do I use your gene sets to analyze data from my favorite array platform?
GSEA provides a utility, Chip2Chip, which translates gene identifiers from one platform to another. For more information, see Chip2Chip in the GSEA User Guide.
How do I find out more information about a particular MSigDB gene set?
Each gene set is described by a Gene Set Card on the GSEA web site.
Can I use the MSigDB gene sets without using GSEA?
Yes. You can download the gene sets from the MSigDB page or access them programmatically by connecting annonymously to the Broad ftp server: ftp.broad.mit.edu://pub/cancer/gsea/gene_sets/all.may_2006.symbols.gmt.
Is MSigDB available as a web service?
No for now.
GSEA Software
How do I increase the amount of memory available to GSEA?
On Windows, if launching via the desktop icon edit the following line in the gsea_home/gsea.lax file:
lax.nl.java.option.java.heap.size.max=512m
If launching via a direct call to the jar file, add a -Xmx flag (run the command from the folder that contains the gsea2.jar file or specify the full path to the .jar file):
java -Xmx512m -jar gsea2.jar
If increasing memory, you might try doubling the default value to 1024m. For Windows, the maximum appears to be
2048m. For linux 32 bit, 1800m.
How do I run GSEA from the command line?
Yes. For more information, see Running GSEA from the Command Line in the GSEA User Guide.
How do I add GSEA to my microarray analysis pipeline?
If you are using GenePattern pipelines (http://www.broad.mit.edu/cancer/software/GenePattern/), GSEA is available as a GenePattern analysis module.
If you are implementing your own microarray analysis pipeline, GSEA can be run from the command line. Use full file specifications and the -Dhome parameter to ensure that you are reading data from and writing data to the desired locations. For more information, see Running GSEA from the Command Line in the GSEA User Guide.
Do I have to be connected to the internet to run GSEA software?
You can use most functions in GSEA without being connected to the internet; for example, you can load files, run analyses, and review analysis results; however, you must be connected to the internet to access the GSEA web site (for example, to download gene sets or read the documentation). If you are working offline, clear the menu item Option>Connect over the internet for gene sets. By default, this item is selected and the Gene sets database parameter (on pages such as Run GSEA) displays gene sets available from the GSEA web site; clearing the menu item disables this feature and avoids the time-consuming attempts to connect to the internet.
What version of R do I need for GSEA software?
Version 1.9 or later.
What version of java do I need for GSEA software?
Java 1.4 or later. Java 1.5 is recommended. If you do not have the correct version of Java installed when you start GSEA, an error message appears referring to an "unsupported class version."
For information about Java on Mac OS X, see http://docs.info.apple.com/article.html?artnum=302412.
Does GSEA have a programmatic API?
Yes. R and Java APIs are available on the GSEA web site.
How do I create the input files for GSEA in R?
The GSEA R code uses the gct, cls and gmt file formats for input. For more information, see Preparing Data Files in the GSEA User Guide.