FAQ

From GeneSetEnrichmentAnalysisWiki
Revision as of 14:15, 16 October 2012 by Liberzon (talk | contribs)
Jump to navigation Jump to search

<a href="http://www.broadinstitute.org/gsea/">GSEA Home</a> | <a href="http://www.broadinstitute.org/gsea/downloads.jsp">Downloads</a> | <a href="http://www.broadinstitute.org/gsea/msigdb/">Molecular Signatures Database</a> | Documentation | <a href="http://www.broadinstitute.org/gsea/contact.jsp">Contact</a>

Contents

GSEA Algorithm

What is the difference between GSEA and an overlap statistic (hypergeometric) analysis tool?

An overlap statistic analysis tool typically uses a threshold to define genes as members at the top or bottom of a ranked list of genes.  In contrast GSEA uses the list rank information without using a threshold. The introduction to the GSEA 2005 PNAS paper discusses the limitations of the former approach and how GSEA addresses them.

Why does GSEA use the Kolmogorov-Smirnov statistic rather than the Mann-Whitney test?

The Kolmogorov-Smirnov statistic is slightly more suitable for less coherent data because it takes relatively fewer significant items to score well. The GSEA 2005 PNAS paper discusses the use of this statistic in detail (see the section titled Adjusting for Variation in Gene Set Size in the supplemental information).

How does GSEA rank the genes in my dataset?

By default, GSEA uses the signal-to-noise metric to rank the genes. Optionally, use the Metric for ranking genes parameter to select the ranking metric that you want GSEA to use.  For more information, see the Metric for ranking genes parameter on the Run GSEA Page in the GSEA User Guide.

Can I use GSEA to analyze my own ranked list of genes?

Yes. Use the GseaPreranked analysis to run the gene set enrichment analysis against your own ranked list of genes. For more information, see GSEAPreranked Page in the GSEA User Guide.

Can I use GSEA to compare two datasets?

Yes. Create a gene set that contains the top genes from the first dataset and use GSEA to analyze that gene set against the second dataset. Similarly, create a gene set that contains the top genes from the second dataset and use GSEA to analyze that gene set against the first dataset. For example, you might analyze the top 100 genes from each dataset.

Can I use GSEA to analyze a dataset that contains a single sample?

Yes.  However, GSEA has no way of ranking the genes in such a dataset. Therefore, you must rank the genes and then use GSEA to analyze the ranked list of genes. For more information, see the GSEA Preranked Page in the GSEA User Guide.

Can I use GSEA to analyze paired samples?

No. GSEA software does not provide paired-sample analysis. If you create a ranked list of genes by running a paired-sample marker analysis outside of GSEA, you can use GSEA to analyze that ranked list of genes. For more information about analyzing your own ranked lists of genes, see the GSEA Preranked Page in the GSEA User Guide.

Can I use GSEA to analyze time series data?

Yes. The phenotype labels (.cls) file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. To analyze time course data, use a continuous phenotype label. For more information, see Phenotype Labels in the GSEA User Guide. When you run the GSEA analysis, select Pearson in the Metric for ranking genes parameter. This is the only metric that can be used with time series data. For more information about the metrics used for ranking genes, see Metrics for Ranking Genes in the GSEA User Guide.

Can I use GSEA to find pathways that correlate to the expression of my favorite gene?

Yes. In your phenotype file, create a continuous phenotype where the expression profile is that of your favorite gene.
You can have GSEA create the necessary phenotype for you: on the Run GSEA page, click the ... button next to the Phenotype labels parameter; when GSEA prompts you to select a phenotype, click the Use a gene as the phenotype button to have GSEA create a continuous phenotype for your gene. For more information, see the Phenotype labels parameter on the Run GSEA Page in the GSEA User Guide.

Can I use GSEA with gene sets that have both up- and down-regulated genes?

The GSEA software does not yet support this, but you can use the enrichment statistic with gene sets that include both up- and down-regulated genes. For one approach, see Lamb, et al 2006.

How do I cite GSEA?

To cite GSEA, please reference Subramanian, Tamayo, et al. 2005 Proc Natl Acad Sci U S A 102(43):15545-50.

To cite your use of the Molecular Signatures Database (MSigDB), please reference Liberzon et al. 2011 Bioinformatics 27(12):1739-40 and also the source for the gene set as listed on the gene set page.

Can I use GSEA to analyze SNP, SAGE, ChIP-Seq or RNA-Seq data?

We are happy to announce that GenePattern now offers a suite of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, and differential expression. In particular, GenePattern has developed the ExprToGct module to convert RNA-Seq data to the gene expression matrix in the GCT format that can serve as an input for GSEA. For further details, please contact the GenePattern team.

Alternatively, one can analyze the data with some extra work. For this, you will have to come up with a ranked list of unique human gene symbols based on your SNP, SAGE, ChIP-Seq or RNA-Seq data. Then you can analyze the ranked list using GSEAPreranked tool.

GSEA Data Files

How do I create an expression dataset file? What types of expression data can I analyze?

GSEA requires that expression data be in a RES, GCT, PCL, or TXT file. All four file formats are tab-delimited text files. For details of each file format, see Data Formats.

GenePattern provides several modules for converting expression data into gct and/or res files:

  • ExpressionFileCreator converts raw expression data from Affymetrix CEL files.

  • GEOImporter and caArrayImportViewer create a GCT file based on expression data extracted from the GEO or caArray microarray expression data repository, respectively.

  • MAGEImportViewer module converts MAGE-ML format data. MAGE-ML is the standard format for storing both Affymetrix and cDNA microarray data at the ArrayExpress repository.

To use expression data stored in any other format (such as cDNA microarray data), first convert the data into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns and then modify that text file to comply with the gct file format requirements as described in Expression Datasets in the GSEA User Guide

If you are using two-color ratio data, see also cDNA Microarray Data.

Parsing Errors: If you see the following parsing error when you load your data file, check the file extension:
There were errors: ERRORS #:1Parsing trouble…


The file extension of the expression dataset file identifies the format of the file. If a gct, res, or pcl file has a .txt file extension, you will see the parsing error when you load the file into GSEA. Check that the file extension matches the file format. Note that some operating systems (such as Windows), can be configured to hide known file extensions. If your operating system is configured to hide known extensions, a file named test.gct.txt will be listed as test.gct. Look at the file type of the file: it should be GCT (or RES or PCL), not Text Document.

How do I filter or pre-process my dataset for GSEA?

How you filter or pre-process your data depends on your study. Here are a few guidelines to consider:

  • Probe identifiers versus gene identifiers. Typically, your dataset contains the probe identifiers native to your microarray platform DNA chip. GSEA can analyze the probe identifiers or collapse each probe set to a gene vector, where the gene is identified by gene symbol. Collapsing the probe sets prevents multiple probes per gene from inflating the enrichment scores and facilitates the biological interpretation of analysis results.
  • AP call filters.  You can run GSEA on filtered or unfiltered data. Typically, the GSEA team runs the analysis on unfiltered data. One suggested approach is to run  GSEA on the unfiltered data. If the results seem dominated by gene sets will poorly expressed genes, you might gain insight into what thresholds to use for the call filters.
  • Expression values. The GSEA algorithm examines the differences in expression values rather than the values themselves. For example, you might have natural scale data or logged expression levels; you might have Affymetrix data or two-color ratio data.<a name="_Toc120959112"></a> As in most data analysis methodologies, the same expression data represented in different formats may generate different analysis results. The differences are expected. GSEA cannot determine which results are "correct."<a name="_Toc120959112"></a>

For more information, see Preparing Data Files in the GSEA User Guide.

Should I use natural or log scale data for GSEA?

We recommend using natural scale data. We used it when we calibrated the GSEA method and it seems to work well in general cases.

Traditional modeling techniques, such as clustering, often benefit from data preprocessing. For example, one might filter expression data to remove genes that have low variance across the dataset and/or log transform the data to make the distribution more symmetric. The GSEA algorithm does not benefit from such preprocessing of the data.

How many samples do I need for GSEA?

This depends on your specific problem and data characteristics; however, as a rule of thumb, you typically want to analyze at least ten samples.

If you have technical replicates, you generally want to remove them by averaging or some other data reduction technique. For example, assume you have five tumor samples and five control samples each run three times (three replicate columns) for a total of 30 data columns. You would average the three replicate columns for each sample and create a dataset containing 10 data columns (five tumor and five control).

How do I create a phenotype label file? What types of experiments can I analyze?

GSEA can be used to analyze experiments of any type (including time-series, three or more classes, and so on). The phenotype labels (cls) ASCII file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. The cls file is an ASCII tab-delimited file, which you can easily create using a text editor. For more information, see Preparing Data Files in the GSEA User Guide.

What gene sets are available? Can I create my own gene sets?

You can use the gene sets in the Molecular Signature Database (MSigDB) or create your own. For more information about the MSigDB gene sets, see the MSigDB page of this web site. For more information about creating gene sets or using gene sets with GSEA, see Preparing Data Files in the GSEA User Guide.

How many genes should there be in a gene set?

GSEA automatically adjusts the enrichment statistics to account for different gene set sizes, as described in the Supplemental Information for the GSEA 2005 PNAS paper.

Can GSEA analyze a gene set that contains duplicate genes? duplicate gene sets?

Duplicate genes in a gene set and duplicate gene sets both effect GSEA results. GSEA automatically removes duplicate genes from each gene set, but does not check for duplicate gene sets. For more information, see Gene Sets in the GSEA User Guide.

Can GSEA analyze a gene set that contains genes that are not in my expression dataset?

The gene set enrichment analysis automatically restricts the gene sets to the genes in the expression dataset. The analysis report lists the gene sets and the number of genes that were included and excluded from the analysis.

What array platforms and organism species does GSEA support?

GSEA works on any data, as long as the gene identifiers in your expression data (the GCT file) match those in the gene sets (the GMT file), or if you have a CHIP file that provides the mapping between gene identifiers in your expression data and gene identifiers in the gene sets. All gene sets in MSigDB use human gene symbols regardless of the original organism. Consequently, our CHIP files provide the mappings from different platforms (e.g., mouse Affymetrix probe set IDs, human Affymetrix probe set IDs, etc.) to human gene symbols.

To see what platforms (CHIP files) are available: start GSEA desktop application and click [...] at "Chip platform(s)" on "Run GSEA" page.

If your platform is not in this list, you have the following options:

  1. Create your own CHIP file to map your platform specific gene identifiers to human gene symbols and then use your CHIP file to collapse dataset in GSEA. The CHIP file format is described here: CHIP file format.
  2. Convert your platform identifiers to human gene symbols outside GSEA, then run GSEA with 'Collapse dataset' = FALSE

Can GSEA analyze miRNA expression data?

The only way for GSEA to analyze expression data with miRNA identifiers is to provide gene sets made of matching miRNA identifiers. This is not possible with MSigDB gene sets, which are made exclusively of protein coding genes in the form of human gene symbols.

GSEA Results

Where are the GSEA statistics (ES, NES, FDR, FWER, nominal p value) described?

For brief descriptions of the statistics that appear in the GSEA analysis report, see Interpreting GSEA in the GSEA User Guide. The GSEA 2005 PNAS paper also describes each of these statistics:

    for FDR and nominal p value, see the section titled Appendix: Mathematical Description of Methods;
    for FWER, see the section titled FWER in the Supplemental Information.

Why does GSEA use a false discovery rate (FDR) of 0.25 rather than the more classic 0.05?

An FDR of 25% indicates that the result is likely to be valid 3 out of 4 times, which is reasonable in the setting of exploratory discovery where one is interested in finding candidate hypothesis to be further validated as a results of future research. Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.

Why does GSEA give me significant results with gene set (tag) permutation, but not with phenotype permutation?

Phenotype permutation generally provides a more stringent assessment of significance and produces fewer false positives. Which permutation type you should use depends on the number of samples that you are analyzing. For more information, see the description of the Permutation type parameter on the Run GSEA Page in the GSEA User Guide.

How can I display details for more than the top 20 gene sets?

By default, the GSEA analysis report generates a Details link, which provides summary plots and detailed analysis results, for the top 20 gene sets in each phenotype. To generate the Details link for additional gene sets, modify the Plot graphs for the top sets of each phenotype parameter on the Run GSEA Page.

What should I do if I have no significant gene sets or too many significant gene sets?

The number of enriched gene sets depends on the structure of the data and the problem space. In general, one would expect to see at least a few gene sets enriched for a typical morphological or tissue-specific phenotype. If no enriched gene sets or a very large number of enriched gene sets pass the FDR threshold, first check that your gene sets and expression dataset use the same array format (see Consistent Feature Identifiers Across Data Files)  and that you have used the appropriate permutation type and number of permutations (see the Run GSEA Page). If you find no issues, consider the following:

  • No enriched gene sets of significance may indicate that, in fact, no gene sets are enriched. It may also be that you are analyzing too few samples, the biological signal in question is subtle, or the gene sets that you are analyzing do not represent the biology in question very well. You may still want to look at the top ranked gene sets, keeping in mind that these results provide weak evidence for potentially interesting hypotheses. You might also want to consider analyzing other gene sets or, if possible, additional samples.
  • Too many enriched gene sets of significance may indicate that, in fact, many gene sets are enriched between phenotypes. Perhaps the gene sets represent the same biological signal. You can check for this by looking for overlap in the leading-edge subsets within the gene sets Running a Leading Edge Analysis). Or, you might be seeing significant differences between the phenotypes due to technical artifacts, such as samples being run in different labs, by different operators, or against different arrays. As with too few enriched gene sets, you may still want to look at the top ranked gene sets, keeping in mind that these results provide potentially biased evidence for interesting hypotheses. You might also want to consider analyzing other gene sets or, if possible, additional samples.

For more information, see Interpreting GSEA in the GSEA User Guide.

What does it mean for a  gene set to have a nominal p value of zero?

A reported p value of zero (0.0) indicates an actual p-value of less than 1/number-of-permutations. For a more accurate p value, increase the number of permutations performed by the analysis. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.

What does it mean for a gene set to have a small nominal p value (p<0.025), but a high FDR value (FDR=1)?

The nominal p value estimates the significance of the observed enrichment score for a single gene set. However, when you are evaluating multiple gene sets, you must correct for multiple hypothesis testing. The FDR is the estimated probability that a gene set with a given enrichment score (normalized for gene set size) represents a false positive finding.

Generally, when your top gene sets have small nominal p values and high FDRs, it is because they are not as significant when compared with other gene sets in the empirical null distribution. This could be because you do not have enough samples, the biological signal is subtle, or the gene sets do not represent the biology in question very well. Also, the FDR is based on all gene sets; if only one of many gene sets is enriched, that gene set is likely to have a high FDR.

For more information, see Interpreting GSEA in the GSEA User Guide.

What is the difference between the weighted statistic and the classic statistic? Which should I use?

See the description of the Enrichment statistic parameter on the Run GSEA Page in the GSEA User Guide.

Why are my results different from yours when I analyze the example datasets using GSEA?

You are using a different random number generator (for sample permutation) and different seeds for that random number generator, so the resulting numbers are different. However, these differences should be VERY SMALL and the IDENTITY of the top (up or down) gene sets should be pretty much the same. The FDRs might be at most a few percent different from run to run. To get exactly the same result from run to run, specify the random number seed (its a parameter in the gsea software).

Why do some of the example datasets contain negative expression values?

Because in older releases of Affymetrix gene chips expression values were trimmed averages over match and mismatched probes. If the mismatch probes were higher a negative number results.

Why comparing phenotypes A vs B gives different results from B vs A?

This is because these two comparisons produce different ranked lists of genes. You might expect similar results only if the ranked lists would be perfectly symmetrical, and this usually does not happen.

MSIGDB Gene Sets

What is MSigDB?

MSigDB, the Molecular Signature Database, contains curated gene sets for use with the gene set enrichment analysis. The GSEA team has begun the critical work of populating the MSigDb with curated gene sets. Increasing the number of gene sets increases the value of this resource; therefore, the GSEA team appreciates gene set contributions and encourages users to submit their gene sets to mailto:genesets@broadinstitute.org. For information about exporting gene sets from the MSigDB, see Gene Sets in the GSEA User Guide.

What is the difference between gene sets in MSigDB and GO/BioCarta/GenMAPP?

MSigDB contains gene sets formatted for use with GSEA. MSigDB places emphasis on a genomic, unbiased approach to the definition of gene sets; therefore, an important component of MSigDB is the collection of gene sets from published expression profiles. Unlike gene sets curated from prior knowledge (such as, GO, BioCarta, and so on), experimental sets provide an unbiased readout of a biological state; experimental sets from microarray experiments reflect purely transcriptional events.

Does MSigDB include pathway diagrams?

No. However, some gene set pages in canonical pathways include external links to the original sources which can have the diagrams.

Does MSigDB include GO gene sets?

The C5 collection in MSigDB v.2.5 and later is made entirely of GO gene sets. In addition, the C2 (curated) category of gene sets in MSigDB v2.5 or earlier contains 17 GO gene sets. For a complete description of these gene sets, see the  MSigDB page.

Why do some MSigDB gene sets have the same gene represented multiple times?

The gene sets reflect the information in the original source and no attempt to modify the definition of a gene set is done (except for eliminating obvious gene duplications).  The gene sets defined in terms of gene symbols eliminate the duplication produced by multiple probes representing the same gene.

How do I use your gene sets to analyze data from my favorite array platform?

GSEA provides a utility, Chip2Chip, which translates the gene identifiers from gene symbols to the probe identifiers of your array platform. For more information, see Chip2Chip in the GSEA User Guide.

How do I convert your gene sets from HUGO gene symbols to ENTREZ gene identifiers?

Starting with v3.0 MSigDB, there is no need to do this: we provide additional GMT files using Entrez Gene IDs as gene identifiers for all our gene sets.

How do I find out more information about a particular MSigDB gene set?

Each gene set in the MSigDB (Molecular Signature Database) is fully described by a gene set page. From this web site, use the 
MSigDB page to find a gene set. Click the gene set name to display its gene set page. From within the GSEA application, use the Browse MSigDB page to browse gene sets and display gene set pages. Alternatively, a Google search on the gene set name also displays a link to the gene set page.

Can I use the MSigDB gene sets without using GSEA?

Yes. You can view and download the gene sets from the MSigDB page.

Can I access the MSigDB gene sets using a web service?

You can use one of two URLs to display the XML representation of a gene set (replace AKTPATHWAY with any gene set name):
http://www.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_AKT_PATHWAY.xml
http://www.broadinstitute.org/gsea/msigdb/geneset_page.jsp?geneSetName=BIOCARTA_AKT_PATHWAY&format=xml

The l Downloads page provides an XML file that contains all MSigDB gene sets and the content of their gene set pages. The MSigDB page provides online access to the gene sets.

What does /// in a gene symbol mean?

This symbol indicates ambiguous mapping according to the Affymentrix conventions and serves as a field separator when a probe set id corresponds to several gene symbols. /// may appear in some gene sets curated form Affymetrix (NetAffx) annotation data. GSEA ignores such genes.

How can I view/access gene sets from the v2.5 release of MSigDB?

The MSigDB v2.5 files are archived and are still available for download on the Downloads page. Scroll down to the bottom of the page to the 'Archived releases' section.

You can also view these archived sets using the MSigDB Browser in the GSEA desktop application (refer to the GSEA 2.0.8 Release Notes <a href="http://www.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Browse_MSigDB_Page"></a>for further instructions).

Should I run GSEA on one or multiple MSigDB collections?

We recommend running GSEA in individual collections, or even sub-collections of gene sets rather that on the entire MSigDB. Using individual collections saves time and produces more optimistic FDR q-values because GSEA has fewer gene sets to test. In addition, the results will be easier to interpret as they will focus your attention on particular kind of gene sets. We have grouped gene sets according to their derivation into collections precisely for these reasons, and provide general suggestions for their usage here: MSigDB collections.

GSEA Software

How do I increase the amount of memory available to GSEA?

From the GSEA web site, the GSEA desktop application launches with 1 GB of memory. To change the amount of memory available to GSEA, download the .jar file and then start the application using a direct call to the jar file. On the command line, use the -Xmx flag to increase the amount of memory available to Java
java -Xmx1024m -jar gsea2-2.08.jar

Please note the following:

  • Check that you are using the correct .jar file
  • Be sure to run the Java command from the folder that contains the .jar file or specify the full path to the .jar file
  • For more information about using GSEA from the command line, refer to Runnig Running GSEA from the Command Line in the User Guide

How do I run GSEA from the command line?

See Running GSEA from the Command Line in the GSEA User Guide.

What version of Java do I need for GSEA software?

GSEA 2.0.8 requires Java 1.6.

What version of R do I need for GSEA software?

 Version 1.9 or later.

How do I add GSEA to my microarray analysis pipeline?

If you are using GenePattern pipelines (http://www.GenePattern.org/), GSEA is available as a GenePattern analysis module.

If you are implementing your own microarray analysis pipeline, GSEA can be run from the command line. Use full file specifications and the -Dhome parameter to ensure that you are reading data from and writing data to the desired locations. For more information, see Running GSEA from the Command Line in the GSEA User Guide.

Do I have to be connected to the internet to run GSEA software?

No. If you download the .jar file, you can use most functions in GSEA without being connected to the internet; for example, you can load files, run analyses, and review analysis results. However:

  • The Chip platform(s) and Gene sets database parameters (on pages such as Run GSEA) display data files available from the Broad ftp site; these data files are not available when you are working offline. Be sure to download the chip files and gene set files that you need before disconnecting from the internet.
  • The GSEA documentation and help files are on the GSEA web site; they are not available when you are working offline.

When working offline, clear the menu item Option>Connect over the internet.  By default, this item is selected and the Chip platform(s) and Gene sets database parameters display data files available from the Broad ftp site. Clearing the menu item disables this feature and avoids any time-consuming attempt to connect to the internet.

What is the difference between GSEA, GSEA-P, and GSEA-R?

GSEA refers to either the gene set enrichment analysis or the GSEA software. GSEA-P refers to the GSEA Java desktop software. GSEA-R refers to the R implementation of the software.

We strongly recommend using GSEA-P for standard analysis of microarray data.
The java implementation of GSEA does not require any programming experience, includes many additional features not present in GSEA-R, and comes with tutorial and extended documentation.

We do not recommend GSEA-R for standard analysis of microarray data.
This is because the R implementation of GSEA is closer to a working prototype than a finished software product. It is intended for users who want to tweak the GSEA algorithm rather than run routine GSEA analysis. We assume that such users not only have a very good command of R but are also familiar with GSEA algorithm to the extent that the code itself is transparent enough, and thus the documentation is minimal. Also consistent with this view, the R implementation offers minimal features, leaving it up to the user to add them.

How do I create the input files for GSEA in R?

The GSEA R code uses the gct, cls and gmt file formats for input. For more information, see Preparing Data Files in the GSEA User Guide.

Does GSEA have a programmatic API? What languages are supported?

The Broad Institute provides R and Java APIs on the Downloads page.

John Aach (Department of Genetics, Harvard Medical School) implemented a simplified version of the GSEA algorithm in MATLAB
as part of the paper Global gene expression of Prochlorococcus ecotypes in response to changes in nitrogen availability. Aach notes that the MATLAB code supports only that portion of the GSEA algorithm required for the paper: it processes a particular file format, its significance calculations are performed by shuffling genes only and not phenotypes, and it does not include multiple hypothesis corrections for different gene sets.

For more information, please visit  John Aach'sGSEA webapge