Using GSEA v3.0 Features

From GeneSetEnrichmentAnalysisWiki

Revision as of 02:19, 2 July 2017 by Eby (Talk | contribs)
Jump to: navigation, search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This page is meant as a brief introduction and guide to using some of the new features in the GSEA v3.0 Beta series. The information will eventually go into our official documentation, but will reside here until then.


Re-running Analyses and Verifying GSEA v3.0 Results with GSEA v2.2.x

  • Our tests show the v3.0 Beta produces equivalent results, but use the Production version if you have concerns. At a minimum, verification with the Production version before publication is strongly recommended.
  • Cross-verifying with the Production version allows you to use the Beta features with confidence that the results are not affected by any changes in the Beta code. To do this, you can re-run your v3.0 analysis in v2.2.x using the equivalent parameter settings.
  • While there are a few new v3.0 parameters, these mostly provide additional control over the result files. Nearly all of the parameters required for analysis are present in both versions (the exception being alternate delimiter, see below). After re-running the analysis, compare the output files and plots from the runs between the two versions. Our testing has shown these to be identical (though see the note on SVG output below).
  • If you have provided a specific random seed value in your Beta analysis simply carry this value over to the Production analysis settings under 'Advanced fields'. In GSEA v2.2.x and earlier, if you used the timestamp setting there was no way to obtain the actual numeric value used in the random number generator, meaning that it was not possible to reliably reproduce the exact analysis output. A new feature has been implemented in the Beta to save this timestamp value into the 'Comments' section of the main page of the HTML Report. By setting this value for the random seed in your GSEA v2.2.x analysis you can reproduce your v3.0 results.

Generating SVG Plots

  • We have occasionally received requests for a feature to generate plots in a higher-resolution than is possible with the PNG format. To meet this need, the Beta offers a new Create SVG plot images analysis setting in the 'Advanced fields' section.
  • This setting is turned off by default as these plots are somewhat CPU-intensive to produce and they can be substantially larger than the corresponding PNGs, e.g. ~150x the size in the case of our Enrichment Plots. The generated files are GZ compressed for the same reason. They compress quite well but can still be up to ~5x the size of the corresponding PNGs. They can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows.
  • The SVG plots should be viewable in most modern web browsers and editable in a variety of software such as Inkscape or Adobe Illustrator; we have no particular recommendations.
  • Note that the SVG plots may match closely but not exactly to the PNGs. The fonts in particular may be slightly different.
  • We recommend running your analyses with this setting disabled and using the PNGs to review your results. When you have a satisfactory analysis run, you can reproduce the results by re-running the analysis with the same random seed setting as described above. We recognize that this is not a convenient workflow, but changing the report generator to allow on-demand SVG generation was not feasible. We may revisit this in the future if circumstances permit, but in the meantime this feature at least provides a means of producing higher-resolution images.

Override of GENE_SYMBOL.chip File Download

  • The GSEA Desktop makes use of a special CHIP file during the Collapse Dataset and Report generation steps. In Desktop v2.2.x, this file is downloaded anew from the Broad FTP site in each session and held in memory only until the program exits. This leads to possible failures for machines that are not connected to the Internet during an analysis run, which is common as these are often-used features. Moreover, it means that the program repeatedly downloads this fixed-content file even though it has not changed in many years. For use at the command-line or via the GenePattern modules, this means a file transfer on every analysis.
  • This Beta introduces an override mechanism to reuse a downloaded file and also avoid errors when running without a connection. This is used by launching the Desktop with the 'GENE_SYMBOL_OVERRIDE' property set to 'true' and with the GENE_SYMBOL.chip from our FTP site saved into the 'gsea_home' sub-directory of the users home directory (generally C:\users\<username>\gsea_home on Windows or /Users/<username>/gsea_home on a Mac). You can open this from within GSEA using Help > Show GSEA home folder.
  • This requires running the GSEA Desktop from the command line. To set this property, use 'java -DGENE_SYMBOL_OVERRIDE=true ...' in addition to the other Java flags and parameters. Alternatively, a path can be provided instead so that the file can be stored elsewhere, e.g. '-DGENE_SYMBOL_OVERRIDE=/my/path/to/GENE_SYMBOL.chip'. The file must retain the GENE_SYMBOL.chip name in either case.
  • This is an experimental feature for now. A future Beta may enable it by default, removing the need to launch from the command line. A future release of the GenePattern modules may include a similar dedicated override.
  • Note that the GENE_SYMBOL.chip has a special use within the GSEA code and should generally not be used directly for a Collapse Dataset call.

Extracting the Heatmap Datasets

  • We have occasionally received requests for more control over the generated heatmap images, such as reordering/clustering rows & columns, changing the color scheme, adding a scale, etc. While these are all very worthwhile, at this time it is not feasible for us to alter GSEA's report generation and plotting code in the significant ways that would be required to accomplish this. While we may revisit this in the future if circumstances permit, the Beta has a new feature that gives an alternate path for users to accomplish this on their own. Some users have also asked for access to the underlying dataset represented by these heatmaps. To satisfy both of these needs, the Beta now offers a new Create GCT Files setting under 'Advanced fields' which will save the datasets backing each of the heatmaps in the report, allowing their use in external visualizers or analysis tools.
  • A GCT file will be created for each heatmap plot; the file names will match except for the '.gct' extension. This is GSEA's standard data matrix format and it can be readily used in R, GenePattern, Morpheus, or GENE-E among other options. See our Data formats page for details.
  • You may wish to use the corresponding CLS file to identify phenotype classes to this external software. This file is available as part of the saved report in the 'edb' subdirectory.
  • A common user question is What is the source of this data? Fortunately, the answer is quite simple: the data comes directly from the input dataset (GCT, RES, PCL, etc). It is your original expression data, just reordered to match the limited set of genes represented in the given heatmap (though possibly "collapsed" to map probe-level data to the corresponding genes). You may thus find these GCTs useful for further downstream analysis of a subset of your data in the context of an individual Gene Set, for example.

Working with C3 MIR datasets using the alternate delimiter setting

  • In general we recommend that Gene Set and input file names should stick to alphanumeric characters, plus the underscore as a separator. Use of other 'special' characters can cause issues on some operating systems and programming languages, and those issues may vary across Mac, Linux, and Windows.
  • The C3 MIR sub-collection did not keep to this advice. It used the comma character as part of the Gene Set name, which conflicts with the separator character GSEA v2.2.3 uses for the Gene Set selector fields. Use of this field to select MIR Gene Sets would thus fail as GSEA could not distinguish between the commas within the names and the commas separating the names. Unfortunately, renaming these Gene Sets is difficult as this sub-collection has been in use for a long time.
  • GSEA v3.0 Beta introduces a new alternate delimiter setting in 'Advanced fields' to allow you to override the default separator for cases like this. We recommend using the semicolon ';' instead.
  • Regardless of the situation with the C3 MIR sub-collection names, we encourage you to avoid these characters in your own Gene Set and file names.
Personal tools