Using GSEA v3.0 Features

From GeneSetEnrichmentAnalysisWiki
Revision as of 23:18, 1 July 2017 by Eby (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This is a brief introduction and guide to using some of the new features in the GSEA v3.0 series. The information will eventually go into our official documentation, but will live here until then.

Re-running Analyses GSEA v3.0 Results

  • A new feature has been implemented in GSEA v3.0 to save this timestamp value into the 'Comments' section of the main page of the HTML Report. By setting this value for the random seed in a new analysis (under 'Advanced fields') you can reproduce your results, so long as you keep the other computational parameters the same. You can vary certain reporting parameters - for example to create SVG plots or export heatmap GCTs (see below) - after you are satisfied with the results.

Generating SVG Plots

  • We occasionally get requests for a feature to generate plots in a higher-resolution than the PNG format allows. To meet this need, GSEA v3.0 offers a new Create SVG plot images analysis setting in the 'Advanced fields' section.
  • This setting is turned off by default as it is somewhat CPU-intensive and because it creates substantially larger plots, e.g. ~150x the size for our Enrichment Plot PNGs. The SVGs are GZ compressed for the same reason. They compress quite well but can still be up to ~5x larger than the PNGs. They can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows.
  • The SVG plots should be viewable in most modern web browsers and editable in a variety of software such as Inkscape or Adobe Illustrator; we have no particular recommendations.
  • Note that the SVG plots may match closely but not exactly to the PNGs. The fonts in particular may be slightly different.
  • We recommend running your analyses with this setting disabled and using the PNGs to review your results. When you have a satisfactory analysis run, you can reproduce the results by re-running the analysis with the same random seed setting as described above. We recognize that this is not a convenient workflow, but changing the report generator to allow on-demand SVG generation was not feasible. We may revisit this in the future if circumstances permit, but in the meantime this feature at least provides a means of producing higher-resolution images.

Better handling of special, internally used CHIP files

  • The GSEA Desktop uses two special files - GENE_SYMBOL.chip and SEQ_ACCESSION.chip - behind the scenes. In GSEA v2.2.x, these files were re-fetched from the Broad FTP site in each session and only kept until program exit. For GSEA v3.0, we now cache these files locally on the user's computer so that the program only needs to download them once.
  • These files can be found within the gsea_home sub-directory of the users home directory (generally C:\users\<username>\gsea_home on Windows or /Users/<username>/gsea_home on a Mac). You can open this from within GSEA using Help > Show GSEA home folder. The CHIP file caching location is gsea_home/file_cache/chip. If GSEA reports any errors trying to read these files, clearing out this location might help resolve the problem.
  • Note that both GENE_SYMBOL and SEQ_ACCESSION serve special internal purposes within the GSEA code and should not be used directly as inputs to your analyses.

Extracting the Heatmap Datasets

  • We occasionally get requests for more control over heatmap images: reordering/clustering rows & columns, changing color scheme, adding a scale, etc. Some users have also asked for access to the underlying dataset represented by these heatmaps.
  • Restructuring the existing code to do this isn't possible at this time. Instead, however, GSEA v3.0 has a new feature for users to do these things on their own: we've added a new Create GCT Files setting under 'Advanced fields' which will save the datasets backing all the heatmaps in the report for use in external visualizers or analysis tools.
  • For each heatmap plot, it creates a GCT file with the same file name (except for the '.gct' extension). This is GSEA's standard data matrix format and it can be readily used in R, GenePattern, Morpheus, or GENE-E among other options. See our Data formats page for details.
  • You may wish to use the corresponding CLS file to identify phenotype classes to external software. This file is saved with the report in the 'edb' subdirectory.
  • To answer the natural question, "What is the source of this data?", it comes directly from the input dataset (GCT, RES, PCL, etc). It is your original expression data, just reordered to match the limited set of genes represented in the given heatmap (and possibly "collapsed" to map probe-level data to genes). You may find these GCTs useful for further downstream analysis of a subset of your data in the context of an individual Gene Set, for example.

Working with older versions of C3 MIR datasets using the alternate delimiter setting

  • In general we recommend that Gene Set and input file names should stick to alphanumeric characters, plus the underscore as a separator. Use of other 'special' characters can cause issues on some operating systems and programming languages, and those issues may vary across Mac, Linux, and Windows.
  • The C3 MIR sub-collection in former versions of MSigDB (v5.2 and earlier) did not keep to this advice. It used the comma character as part of the Gene Set name, which conflicts with the separator character GSEA v2.x used for the Gene Set selector fields and cause failures as GSEA could not distinguish between the commas within the names and the commas separating the names. These Gene Sets have been renamed in MSigDB v6.0 to avoid these issues.
  • GSEA v3.0 introduces a new alternate delimiter setting in 'Advanced fields' to allow you to override the default separator for cases like this. We recommend using the semicolon ';' instead.
  • We encourage you to avoid these characters in your own Gene Set and file names. It's safest to stick to alphanumeric characters, possibly with an underscore in place of spaces or other special characters.