Difference between revisions of "Using GSEA v3.0 Features"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
 
Line 6: Line 6:
  
 
<p>
 
<p>
This page is meant as a brief introduction and guide to using some of the new features in the GSEA v3.0 Beta series.  The information will eventually go into our official documentation, but will reside here until then.
+
This is a brief introduction and guide to using some of the new features in the GSEA v3.0 series.  The information will eventually go into our official documentation, but will live here until then.
 
</p>
 
</p>
  
<h1>Re-running Analyses and Verifying GSEA v3.0 Results with GSEA v2.2.x</h1>
+
<h1>Re-running Analyses GSEA v3.0 Results</h1>
 
<ul>
 
<ul>
 
<li>
 
<li>
Our tests show the v3.0 Beta produces equivalent results, but use the Production version if you have concerns.  At a minimum, verification with the Production version before publication is strongly recommended.
+
A new feature has been implemented in GSEA v3.0 to save this timestamp value into the 'Comments' section of the main page of the HTML Report.  By setting this value for the random seed in a new analysis (under 'Advanced fields') you can reproduce your results, so long as you keep the other computational parameters the same. You can vary certain reporting parameters - for example to create SVG plots or export heatmap GCTs (see below) - after you are satisfied with the results.
</li>
 
<li>
 
Cross-verifying with the Production version allows you to use the Beta features with confidence that the results are not affected by any changes in the Beta code.  To do this, you can re-run your v3.0 analysis in v2.2.x using the equivalent parameter settings. 
 
</li>
 
<li>
 
While there are a few new v3.0 parameters, these mostly provide additional control over the result files.  Nearly all of the parameters required for analysis are present in both versions (the exception being <i>alternate delimiter</i>, see below).  After re-running the analysis, compare the output files and plots from the runs between the two versions.  Our testing has shown these to be identical (though see the note on <i>SVG</i> output below).
 
</li>
 
<li>
 
If you have provided a specific random seed value in your Beta analysis simply carry this value over to the Production analysis settings under 'Advanced fields'.  In GSEA v2.2.x and earlier, if you used the <i>timestamp</i> setting there was no way to obtain the actual numeric value used in the random number generator, meaning that it was not possible to reliably reproduce the exact analysis output.  A new feature has been implemented in the Beta to save this timestamp value into the 'Comments' section of the main page of the HTML Report.  By setting this value for the random seed in your GSEA v2.2.x analysis you can reproduce your v3.0 results.
 
 
</li>
 
</li>
 
</ul>
 
</ul>
Line 28: Line 19:
 
<ul>
 
<ul>
 
<li>
 
<li>
We have occasionally received requests for a feature to generate plots in a higher-resolution than is possible with the PNG format.  To meet this need, the Beta offers a new <i>Create SVG plot images</i> analysis setting in the 'Advanced fields' section.
+
We occasionally get requests for a feature to generate plots in a higher-resolution than the PNG format allows.  To meet this need, GSEA v3.0 offers a new <i>Create SVG plot images</i> analysis setting in the 'Advanced fields' section.
 
</li>
 
</li>
 
<li>
 
<li>
This setting is turned off by default as these plots are somewhat CPU-intensive to produce and they can be substantially larger than the corresponding PNGs, e.g. <b>~150x</b> the size in the case of our Enrichment Plots.  The generated files are GZ compressed for the same reason.  They compress quite well but can still be up to ~5x the size of the corresponding PNGs.  They can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows.
+
This setting is turned off by default as it is somewhat CPU-intensive and because it creates substantially larger plots, e.g. <b>~150x</b> the size for our Enrichment Plot PNGs.  The SVGs are GZ compressed for the same reason.  They compress quite well but can still be up to ~5x larger than the PNGs.  They can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows.
 
</li>
 
</li>
 
<li>
 
<li>
Line 44: Line 35:
 
</ul>
 
</ul>
  
<h1>Override of GENE_SYMBOL.chip File Download</h1>
+
<h1>Better handling of special, internally used CHIP files</h1>
 
<ul>
 
<ul>
 
<li>
 
<li>
The GSEA Desktop makes use of a special CHIP file during the Collapse Dataset and Report generation steps.  In Desktop v2.2.x, this file is downloaded anew from the Broad FTP site in each session and held in memory only until the program exitsThis leads to possible failures for machines that are not connected to the Internet during an analysis run, which is common as these are often-used features.  Moreover, it means that the program repeatedly downloads this fixed-content file even though it has not changed in many years.  For use at the command-line or via the GenePattern modules, this means a file transfer on every analysis.
+
The GSEA Desktop uses two special files - GENE_SYMBOL.chip and SEQ_ACCESSION.chip - behind the scenes.  In GSEA v2.2.x, these files were re-fetched from the Broad FTP site in each session and only kept until program exitFor GSEA v3.0, we now cache these files locally on the user's computer so that the program only needs to download them once.
 
</li>
 
</li>
 
<li>
 
<li>
This Beta introduces an override mechanism to reuse a downloaded file and also avoid errors when running without a connection.  This is used by launching the Desktop with the 'GENE_SYMBOL_OVERRIDE' property set to 'true' and with the [ftp://ftp.broadinstitute.org/pub/gsea/annotations/GENE_SYMBOL.chip GENE_SYMBOL.chip] from our FTP site saved into the 'gsea_home' sub-directory of the users home directory (generally C:\users\<username>\gsea_home on Windows or /Users/<username>/gsea_home on a Mac).  You can open this from within GSEA using <i>Help > Show GSEA home folder</i>.
+
These files can be found within the ''gsea_home'' sub-directory of the users home directory (generally C:\users\<username>\gsea_home on Windows or /Users/<username>/gsea_home on a Mac).  You can open this from within GSEA using <i>Help > Show GSEA home folder</i>.  The CHIP file caching location is ''gsea_home''/file_cache/chip.  If GSEA reports any errors trying to read these files, clearing out this location might help resolve the problem.
 
</li>
 
</li>
 
<li>
 
<li>
This requires [http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html?Running_Command_Line running the GSEA Desktop from the command line].  To set this property, use 'java -DGENE_SYMBOL_OVERRIDE=true ...' in addition to the other Java flags and parameters.  Alternatively, a path can be provided instead so that the file can be stored elsewhere, e.g. '-DGENE_SYMBOL_OVERRIDE=/my/path/to/GENE_SYMBOL.chip'.  The file <b>must</b> retain the GENE_SYMBOL.chip name in either case.
+
Note that both GENE_SYMBOL and SEQ_ACCESSION serve special internal purposes within the GSEA code and should not be used directly as inputs to your analyses.
</li>
 
<li>
 
<b>This is an experimental feature for now</b>.  A future Beta may enable it by default, removing the need to launch from the command line.  A future release of the GenePattern modules may include a similar dedicated override.
 
</li>
 
<li>
 
Note that the GENE_SYMBOL.chip has a special use within the GSEA code and should generally not be used directly for a Collapse Dataset call.
 
 
</li>
 
</li>
 
</ul>
 
</ul>
Line 66: Line 51:
 
<ul>
 
<ul>
 
<li>
 
<li>
We have occasionally received requests for more control over the generated heatmap images, such as reordering/clustering rows & columns, changing the color scheme, adding a scale, etc.  While these are all very worthwhile, at this time it is not feasible for us to alter GSEA's report generation and plotting code in the significant ways that would be required to accomplish this.  While we may revisit this in the future if circumstances permit, the Beta has a new feature that gives an alternate path for users to accomplish this on their own.  Some users have also asked for access to the underlying dataset represented by these heatmaps.  To satisfy both of these needs, the Beta now offers a new <i>Create GCT Files</i> setting under 'Advanced fields' which will save the datasets backing each of the heatmaps in the report, allowing their use in external visualizers or analysis tools.
+
We occasionally get requests for more control over heatmap images: reordering/clustering rows & columns, changing color scheme, adding a scale, etc.  Some users have also asked for access to the underlying dataset represented by these heatmaps.</li>
 +
<li>
 +
Restructuring the existing code to do this isn't possible at this timeInstead, however, GSEA v3.0 has a new feature for users to do these things on their own: we've added a new <i>Create GCT Files</i> setting under 'Advanced fields' which will save the datasets backing all the heatmaps in the report for use in external visualizers or analysis tools.
 
</li>
 
</li>
 
<li>
 
<li>
A GCT file will be created for each heatmap plot; the file names will match except for the '.gct' extension.  This is GSEA's standard data matrix format and it can be readily used in R, [http://www.genepattern.org GenePattern], [https://software.broadinstitute.org/morpheus/ Morpheus], or [https://software.broadinstitute.org/GENE-E/ GENE-E] among other options.  See our [[Data formats]] page for details.
+
For each heatmap plot, it creates a GCT file with the same file name (except for the '.gct' extension).  This is GSEA's standard data matrix format and it can be readily used in R, [http://www.genepattern.org GenePattern], [https://software.broadinstitute.org/morpheus/ Morpheus], or [https://software.broadinstitute.org/GENE-E/ GENE-E] among other options.  See our [[Data formats]] page for details.
 
</li>
 
</li>
 
<li>
 
<li>
You may wish to use the corresponding CLS file to identify phenotype classes to this external software.  This file is available as part of the saved report in the 'edb' subdirectory.
+
You may wish to use the corresponding CLS file to identify phenotype classes to external software.  This file is saved with the report in the 'edb' subdirectory.
 
</li>
 
</li>
 
<li>
 
<li>
A common user question is <b>What is the source of this data?</b> Fortunately, the answer is quite simple: the data comes directly from the input dataset (GCT, RES, PCL, etc).  It is your original expression data, just reordered to match the limited set of genes represented in the given heatmap (though possibly "collapsed" to map probe-level data to the corresponding genes).  You may thus find these GCTs useful for further downstream analysis of a subset of your data in the context of an individual Gene Set, for example.
+
To answer the natural question, &quot;<b>What is the source of this data?</b>&quot;, it comes directly from the input dataset (GCT, RES, PCL, etc).  It is your original expression data, just reordered to match the limited set of genes represented in the given heatmap (and possibly "collapsed" to map probe-level data to genes).  You may find these GCTs useful for further downstream analysis of a subset of your data in the context of an individual Gene Set, for example.
 
</li>
 
</li>
 
</ul>
 
</ul>
  
<h1>Working with C3 MIR datasets using the <i>alternate delimiter</i> setting</h1>
+
<h1>Working with older versions of C3 MIR datasets using the <i>alternate delimiter</i> setting</h1>
 
<ul>
 
<ul>
 
<li>
 
<li>
Line 85: Line 72:
 
</li>
 
</li>
 
<li>
 
<li>
The C3 MIR sub-collection did not keep to this advice. It used the comma character as part of the Gene Set name, which conflicts with the separator character GSEA v2.2.3 uses for the Gene Set selector fields.  Use of this field to select MIR Gene Sets would thus fail as GSEA could not distinguish between the commas <i>within the names</i> and the commas <i>separating the names</i>.  Unfortunately, renaming these Gene Sets is difficult as this sub-collection has been in use for a long time.
+
The C3 MIR sub-collection in former versions of MSigDB (v5.2 and earlier) did not keep to this advice. It used the comma character as part of the Gene Set name, which conflicts with the separator character GSEA v2.x used for the Gene Set selector fields and cause failures as GSEA could not distinguish between the commas <i>within the names</i> and the commas <i>separating the names</i>.  These Gene Sets have been renamed in MSigDB v6.0 to avoid these issues.
 
</li>
 
</li>
 
<li>
 
<li>
GSEA v3.0 Beta introduces a new <i>alternate delimiter</i> setting in 'Advanced fields' to allow you to override the default separator for cases like this.  We recommend using the semicolon ';' instead.
+
GSEA v3.0 introduces a new <i>alternate delimiter</i> setting in 'Advanced fields' to allow you to override the default separator for cases like this.  We recommend using the semicolon ';' instead.
 
</li>
 
</li>
 
<li>
 
<li>
Regardless of the situation with the C3 MIR sub-collection names, we encourage you to avoid these characters in your own Gene Set and file names.
+
We encourage you to avoid these characters in your own Gene Set and file names.  It's safest to stick to alphanumeric characters, possibly with an underscore in place of spaces or other special characters.
 
</li>
 
</li>
 
</ul>
 
</ul>

Latest revision as of 23:18, 1 July 2017

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

This is a brief introduction and guide to using some of the new features in the GSEA v3.0 series. The information will eventually go into our official documentation, but will live here until then.

Re-running Analyses GSEA v3.0 Results

  • A new feature has been implemented in GSEA v3.0 to save this timestamp value into the 'Comments' section of the main page of the HTML Report. By setting this value for the random seed in a new analysis (under 'Advanced fields') you can reproduce your results, so long as you keep the other computational parameters the same. You can vary certain reporting parameters - for example to create SVG plots or export heatmap GCTs (see below) - after you are satisfied with the results.

Generating SVG Plots

  • We occasionally get requests for a feature to generate plots in a higher-resolution than the PNG format allows. To meet this need, GSEA v3.0 offers a new Create SVG plot images analysis setting in the 'Advanced fields' section.
  • This setting is turned off by default as it is somewhat CPU-intensive and because it creates substantially larger plots, e.g. ~150x the size for our Enrichment Plot PNGs. The SVGs are GZ compressed for the same reason. They compress quite well but can still be up to ~5x larger than the PNGs. They can be decompressed using 'gunzip' on Mac or Linux and 7-Zip on Windows.
  • The SVG plots should be viewable in most modern web browsers and editable in a variety of software such as Inkscape or Adobe Illustrator; we have no particular recommendations.
  • Note that the SVG plots may match closely but not exactly to the PNGs. The fonts in particular may be slightly different.
  • We recommend running your analyses with this setting disabled and using the PNGs to review your results. When you have a satisfactory analysis run, you can reproduce the results by re-running the analysis with the same random seed setting as described above. We recognize that this is not a convenient workflow, but changing the report generator to allow on-demand SVG generation was not feasible. We may revisit this in the future if circumstances permit, but in the meantime this feature at least provides a means of producing higher-resolution images.

Better handling of special, internally used CHIP files

  • The GSEA Desktop uses two special files - GENE_SYMBOL.chip and SEQ_ACCESSION.chip - behind the scenes. In GSEA v2.2.x, these files were re-fetched from the Broad FTP site in each session and only kept until program exit. For GSEA v3.0, we now cache these files locally on the user's computer so that the program only needs to download them once.
  • These files can be found within the gsea_home sub-directory of the users home directory (generally C:\users\<username>\gsea_home on Windows or /Users/<username>/gsea_home on a Mac). You can open this from within GSEA using Help > Show GSEA home folder. The CHIP file caching location is gsea_home/file_cache/chip. If GSEA reports any errors trying to read these files, clearing out this location might help resolve the problem.
  • Note that both GENE_SYMBOL and SEQ_ACCESSION serve special internal purposes within the GSEA code and should not be used directly as inputs to your analyses.

Extracting the Heatmap Datasets

  • We occasionally get requests for more control over heatmap images: reordering/clustering rows & columns, changing color scheme, adding a scale, etc. Some users have also asked for access to the underlying dataset represented by these heatmaps.
  • Restructuring the existing code to do this isn't possible at this time. Instead, however, GSEA v3.0 has a new feature for users to do these things on their own: we've added a new Create GCT Files setting under 'Advanced fields' which will save the datasets backing all the heatmaps in the report for use in external visualizers or analysis tools.
  • For each heatmap plot, it creates a GCT file with the same file name (except for the '.gct' extension). This is GSEA's standard data matrix format and it can be readily used in R, GenePattern, Morpheus, or GENE-E among other options. See our Data formats page for details.
  • You may wish to use the corresponding CLS file to identify phenotype classes to external software. This file is saved with the report in the 'edb' subdirectory.
  • To answer the natural question, "What is the source of this data?", it comes directly from the input dataset (GCT, RES, PCL, etc). It is your original expression data, just reordered to match the limited set of genes represented in the given heatmap (and possibly "collapsed" to map probe-level data to genes). You may find these GCTs useful for further downstream analysis of a subset of your data in the context of an individual Gene Set, for example.

Working with older versions of C3 MIR datasets using the alternate delimiter setting

  • In general we recommend that Gene Set and input file names should stick to alphanumeric characters, plus the underscore as a separator. Use of other 'special' characters can cause issues on some operating systems and programming languages, and those issues may vary across Mac, Linux, and Windows.
  • The C3 MIR sub-collection in former versions of MSigDB (v5.2 and earlier) did not keep to this advice. It used the comma character as part of the Gene Set name, which conflicts with the separator character GSEA v2.x used for the Gene Set selector fields and cause failures as GSEA could not distinguish between the commas within the names and the commas separating the names. These Gene Sets have been renamed in MSigDB v6.0 to avoid these issues.
  • GSEA v3.0 introduces a new alternate delimiter setting in 'Advanced fields' to allow you to override the default separator for cases like this. We recommend using the semicolon ';' instead.
  • We encourage you to avoid these characters in your own Gene Set and file names. It's safest to stick to alphanumeric characters, possibly with an underscore in place of spaces or other special characters.