Difference between revisions of "R-GSEA Readme"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
 
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
The GSEA program is provided as an standalone R program. This file describes the procedure to run the R program to reproduce the results shown in the paper. These are examples of the use of the method and can easily be modified to work on other datasets in the user's computer. <br /><br /> The zip file (<span class="unix">GSEA.Examples.zip</span> ) contains all the data, R scripts and results of the examples described in the paper.  <br /><br /> To run the R program first you have to expand the<span class="unix">GSEA.Examples.zip</span> in a directory of your choice on your machine (lets call it <span class="unix">&ldquo;my-directory&rdquo;</span>). It should create the following subdirectory structure: <br /><br />
+
[http://www.broadinstitute.org/gsea/ GSEA Home] |
<table width="100%" cellspacing="4" cellpadding="4">
+
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |
    <tbody>
+
[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |
        <tr>
+
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
            <td class="unix">&lt;my-directory&gt;/GSEA/Examples/*</td>
+
[http://www.broadinstitute.org/gsea/contact.jsp Contact]
            <td>Has example scripts for all the examples in the paper and dataset subdirectories for each example. It laso has the complete results for the ALLAML_S1 example.</td>
+
<br>
        </tr>
+
<p>The GSEA program is provided as an standalone R program, which is available on the [http://software.broadinstitute.org/gsea/downloads_archive.jsp Archived Downloads] page. Note that the R program was last updated in 2005 and may not work as-is with modern R releases. It is made available for reference purposes only and is no longer maintained or supported.</p>
        <tr>
+
 
            <td class="unix">&lt;my-directory&gt;/GSEA/AnnotationFiles/*</td>
+
<p>A readme file included with the R program contains instructions on how to run the program. The readme file is reproduced below for your convenience.</p>
            <td>Gene annotation files.</td>
+
 
        </tr>
+
<strong>Note</strong>: The GSEA-P-R program has the following limitations:<br>
        <tr>
 
            <td class="unix">&lt;my-directory&gt;/GSEA/GeneSetDatabases/*</td>
 
            <td>Gene sets databases and annotation files. The gene annotation files are Affymetrix's.</td>
 
        </tr>
 
        <tr>
 
            <td class="unix">&lt;my-directory&gt;/GSEA/method/GSEA.R</td>
 
            <td>GSEA R program.</td>
 
        </tr>
 
        <tr>
 
            <td class="unix">&lt;my-directory&gt;/GSEA/method/Documentation</td>
 
            <td>Documentation for individual GSEA R functions (the way is done for an R package)</td>
 
        </tr>
 
        <tr>
 
            <td class="unix">&lt;my-directory&gt;/GSEA/README.txt</td>
 
            <td>The README file (containing this information)</td>
 
        </tr>
 
    </tbody>
 
</table>
 
<br /><br /> Then to run a specific example do the following (these are for the Leukemia S1 example): <br /><br /> Use a text editor to open <span class="unix">&lt;my-directory&gt;/GSEA/Examples/Run.ALLAML_S1.R.</span><br /> In that file change the file pathnames (there are 7 of them) to point to the right directories in your computer. For example, if you want to run using the same directory structure as the examples, replace all instances of <span class="unix">&quot;d:/CGP2004&quot;</span> with <span class="unix">&quot;&lt;my-directory&gt;&quot;</span>. <br /><br /> That is all you need to make the script ready to run on your machine. You can also change the &quot;nperm&quot; parameters to 20 instead of 1000 to make a quick run and test that everything is ok before making the longer actual run of the example. After changing the pathnames then you can just cut and paste its content into an R console or &quot;source&quot; the file from your R command line. The R script &ldquo;sources&rdquo; the GSEA program from <span class="unix">&lt;my-directory&gt;/GSEA/method/GSEA.R</span> and then calls the GSEA program with all the files and parameter settings. The GSEA program will run and produce (overwrite) the files in directory <span class="unix">&lt;my-directory&gt;/GSEA/Examples/ALLAML_S1</span>. You can move or delete all the result files under <span class="unix">ALLAML_S1</span> to another place prior to run the script (but leave the input datasets <span class="unix">allaml.dataset.gct</span> and <span class="unix">allaml.phenotype.cls</span>). If you didn't save the result files and want to check if your run produced the same results you always can go back to the original files in the zip file. When you run those scripts with the same parameters (remember to set nperm = 1000 in case you changed for a quick run) you should obtain the same identical results included as the original results files (<span class="unix">*.report.txt, global.plots.jpeg,</span> etc.) <br /><br /> In the same way you can reproduce the results of the other examples. Once you manage to do this you can try the GSEA method on your own data by just replacing the input datasets and potentially the gene set databases. The gene set databases have been updated frequently so we have saved the particular version that was used in the original run of each example. If you want generic versions of the gene set databases use the ones with the explicit chip type in the name such as <span class="unix">s1.hgu95av2.gmt.</span> <br /><br />
 
<h2>Description of GSEA output</h2>
 
The results of the GSEA are stored in the <span class="unix">&quot;output.directory&quot;</span> specified by the user as part of the input parameters to the GSEA R program. The results files are: <br /><br />
 
 
<ul>
 
<ul>
    <li>Two tab-separated global results text files (one for each phenotype). These files are labeled according to the doc string prefix and the phenotype name from the CLS (class) file: <span class="unix">&lt;doc.string&gt;.results.report.&lt;phenotype&gt;.txt</span></li>
+
<li>requires exactly two phenotype classes</li>
    <li>One set of global plots. They include a) gene list correlation profile, b) global observed and null densities, c) heat map for the entire sorted dataset, and d) p-values vs. NES plot. These plots are in a single JPEG file named <span class="unix">&lt;doc.string&gt;.global.plots.&lt;phenotype&gt;.jpg</span>. When the program is run interactively these plots appear on a window in the R GUI.</li>
+
<li>does not collapse dataset to gene symbols</li>
    <li>A variable number of tab-separated gene set results files according to how many sets pass any of the significance thresholds (<span class="unix">&quot;nom.p.val.threshold,&quot; &quot;fwer.p.val.threshold,&quot;</span> and <span class="unix">&quot;fdr.q.val.threshold&quot;</span>) and how many are specified in the &quot;topgs&quot; parameter. These files are named: <span class="unix">&lt;doc.string&gt;.&lt;gene set name&gt;.report.txt</span>. </li>
+
<li>does not perform permutations by gene_set</li>
    <li>A variable number of gene set plots (one for each gene set report file). These plots include a) gene set running enrichment &quot;mountain&quot; plot, b) gene set null distribution and c) heat map for genes in the gene set. These plots are stored in a single JPEG file named <span class="unix">&lt;doc.string&gt;.&lt;gene set name&gt;.jpg</span>.</li>
 
 
</ul>
 
</ul>
<br /> The format (columns) for the global result files is as follows. <br /><br />
+
<p>These are the instructions to run the R version of the GSEA program (GSEA-P-R.ZIP). There is a more user friendly version of GSEA-P written in Java, the GSEA desktop application. If you want to run GSEA and you are not a programmer or a computational biologist that version may be a better choice. The R version is intended for more computational experienced biologists, bioinformaticians or computational biologists who are familiar with GSEA algorithm and want to use the R implementation to further explore GSEA method. </p>
<table width="100%" cellspacing="4" cellpadding="4">
+
<p>The GSEA-P-R program described here reflects the version of the methodology described and used in the Subramanian and Tamayo et al 2005 paper. For details about the method and the content of the output please see Supporting Information for that paper.</p>
    <tbody>
+
<p>You need to install R release 2.0 or later.</p>
        <tr>
+
<p>
            <td align="right" class="unix"><strong>GS :</strong></td>
+
- Copy the GSEA-P-R.ZIP file to your computer. <br>
            <td>Gene set name.</td>
+
- Unzip the file GSEA-P-R.ZIP using the option to create subdirectories.<br>
        </tr>
+
&nbsp; This should create the following files and subdirectories:<br>
        <tr>
+
</p>
            <td align="right" class="unix"><strong>SIZE :</strong></td>
+
GSEA program and functions in R (all the GSEA code is contained there):<br>
            <td>Number of genes in the set.</td>
+
GSEA/GSEA-P-R/GSEA.1.0.R<br>
        </tr>
+
<br>
        <tr>
+
Directory with input datasets, gct and cls files:<br>
            <td align="right" class="unix"><strong>SOURCE :</strong></td>
+
GSEA/GSEA-P-R/Datasets/<br>
            <td>Set definition or source.</td>
+
Gender.gct<br>
        </tr>
+
Gender.cls<br>
        <tr>
+
Leukemia.gct<br>
            <td align="right" class="unix"><strong>ES :</strong></td>
+
Leukemia.cls<br>
            <td>Enrichment score.</td>
+
Lung_Boston.gct<br>
        </tr>
+
Lung_Boston.cls<br>
        <tr>
+
Lung_Michigan.gct<br>
            <td align="right" class="unix"><strong>NES :</strong></td>
+
Lung_Michigan.cls<br>
            <td>Normalized (multiplicative rescaling) normalized enrichment score.</td>
+
Lung_Stanford.gct<br>
        </tr>
+
Lung_Stanford.cls<br>
        <tr>
+
Lung_Bost_maxed_common_Mich_Bost.gct<br>
            <td align="right" class="unix"><strong>NOM p-val :</strong></td>
+
Lung_Mich_maxed_common_Mich_Bost.gct<br>
            <td>Nominal p-value (from the null distribution of the gene set).</td>
+
P53.gct<br>
        </tr>
+
P53.cls<br>
        <tr>
+
<br>
            <td align="right" class="unix"><strong>FDR q-val :</strong></td>
+
Directory with gene set databases, gmt files:<br>
            <td>False discovery rate q-values.</td>
+
GSEA/GSEA-P-R/GeneSetDatabases/<br>
        </tr>
+
C1.gmt<br>
        <tr>
+
C2.gmt<br>
            <td align="right" class="unix"><strong>FWER p-val :</strong></td>
+
C3.gmt<br>
            <td>Family wise error rate p-values.</td>
+
C4.gmt<br>
        </tr>
+
Lung_Boston_poor_outcome.gmt<br>
        <tr>
+
Lung_Michigan_poor_outcome.gmt<br>
            <td align="right" class="unix"><strong>Tag %:</strong></td>
+
<br>
            <td>Percent of gene set before running enrichment peak.</td>
+
Directories with results of running the examples described in the paper:<br>
        </tr>
+
<br>
        <tr>
+
GSEA/GSEA-P-R/Gender_C1/<br>
            <td align="right" class="unix"><strong>Gene %:</strong></td>
+
Gender_C2<br>
            <td>Percent of gene list before running enrichment peak.</td>
+
Leukemia_C1<br>
        </tr>
+
Lung_Boston_C2<br>
        <tr>
+
Lung_Stanford_C2 <br>
            <td align="right" class="unix"><strong>Signal :</strong></td>
+
Lung_Michigan_C2<br>
            <td>Enrichment signal strength.</td>
+
Lung_Boston_outcome <br>
        </tr>
+
Lung_Michigan_outcome<br>
        <tr>
+
P53_C2<br>
            <td align="right" class="unix"><strong>FDR(median):</strong></td>
+
<br>
            <td>FDR q-values from the median of the null distributions.</td>
+
The top 20 high scoring gene sets are reported in table S2 (Supporting Information).<br>
        </tr>
+
<br>
        <tr>
+
One page R scripts to run the examples described in the paper:<br>
            <td align="right" class="unix"><strong>glob.p.val :</strong></td>
+
<br>
            <td>P-value using a global statistic (number of sets above the given set's NES).</td>
+
GSEA/GSEA-P-R/<br>
        </tr>
+
Run.Gender_C1.R<br>
    </tbody>
+
Run.Gender_C2.R<br>
</table>
+
Run.Leukemia_C1.R<br>
<br /> The rows are sorted by the NES values (from maximum positive or negative NES to minimum) <br /><br /> The format (columns) for the individual gene set result files is as follows. <br /><br />
+
Run.Lung_Boston_C2.R<br>
<table width="100%" cellspacing="4" cellpadding="4">
+
Run.Lung_Stanford_C2.R<br>
    <tbody>
+
Run.Lung_Michigan_C2.R<br>
        <tr>
+
Run.Lung_Boston_outcome.R<br>
            <td align="right" class="unix"><strong># :</strong></td>
+
Run.Lung_Michigan_outcome.R<br>
            <td>Gene number in the (sorted) gene set.</td>
+
Run.P53_C2.R<br>
        </tr>
+
<br>
        <tr>
+
To run, for example, the Leukemia dataset with the C1 gene set database go to the file GSEA/GSEA-P-R/Run.Leukemia_C1.R and change the file pathnames to reflect the location of the GSEA directory in your machine. For example if you expanded the ZIP file under your directory &quot;C:/my_directory&quot; you need to change the line: <br>
            <td align="right" class="unix"><strong>PROBE_ID :</strong></td>
+
<br>
            <td>The gene name or accession number in the dataset.</td>
+
<tt>GSEA.program.location &lt;- &quot;d:/CGP2005/GSEA/GSEA-P-R/GSEA.1.0.R&quot;</tt><br >
        </tr>
+
To:<br><br>
        <tr>
+
<tt>GSEA.program.location &lt;- &quot;c:my_directory/GSEA/GSEA-P-R/GSEA.1.0.R&quot;</tt><br>
            <td align="right" class="unix"><strong>SYMBOL :</strong></td>
+
And the same change to each pathname in that file: you need to replace
            <td>The gene symbol from the gene annotation file.</td>
+
<tt>&quot;d:/CGP2005&quot; </tt> with <tt>&quot;C&quot;/my_directory&quot;</tt>.<br><br>
        </tr>
+
You may also want to change the line:<br><br>
        <tr>
+
<tt>doc.string = &quot;Leukemia_C1&quot;,</tt><br>
            <td align="right" class="unix"><strong>DESC :</strong></td>
+
<br>To:<br>
            <td>The gene description (title) from the gene annotation file.</td>
+
<tt>doc.string = &quot;my_run_of_Leukemia_C1&quot;,</tt><br>
        </tr>
+
<br>or any other prefix label you want to give your results. This way you won't overwrite the original results that come in those directories and can use them for comparison with the results of you own run. <br>
        <tr>
+
<p>
            <td align="right" class="unix"><strong>LIST LOC :</strong></td>
+
After the pathnames have been changed to reflect the location of the directories in your machine to run GSEA program just open the R GUI and paste the content of the <br>
            <td>The location of the gene in the sorted gene list.</td>
+
<tt>Run.&lt;example&gt;.R</tt><br>
        </tr>
+
files on it.<br>
        <tr>
+
For example, to run the Leukemia vs. C1 example, use the contents of the file <tt>&quot;Run.Leukemia_C1.R&quot;</tt>. The program is self-contained and should run and produce the results under the directory <tt>&quot;C:my_directory/GSEA/GSEA-P-R/Leukemia_C1&quot;</tt>. These files are set up with the parameters used in the examples of the paper (e.g. to produce detailed results for the significant and top 20 gene sets). You may want to start using these parameters and change them only when needed and when you get more experience with the program. For details on the effects of changing some of the parameters, see the Supporting Information document.</p>
            <td align="right" class="unix"><strong>S2N :</strong></td>
+
If you want to run a completely new dataset the easiest way is:<br>
            <td>The signal to noise ratio (correlation) of the gene in the gene list.</td>
+
<ol>
        </tr>
+
<li> Create a new directory: e.g. GSEA/GSEA-P-R/my_dataset, where you can store the inputs and outputs of running GSEA on those files. </li>
        <tr>
+
<li>Convert manually your files to *.gct (expression dataset) and *.cls (phenotype labels)</li>
            <td align="right" class="unix"><strong>RES :</strong></td>
+
<li>Use Run.Leukemia_C1.R as a template to make a new script to run your data.</li>
            <td>The value of the running enrichment score at the gene location.</td>
+
<li>Change the relevant pathnames to point to your input files in directory my_dataset. Change the doc.string to an approprote prefix name for your files.</li>
        </tr>
+
<li>Cut and paste the contents of this new script file in the R GUI to run it. The results will be stored in my_directory.</li>
        <tr>
+
</ol>
            <td align="right" class="unix"><strong>CORE_ENRICHMENT:</strong></td>
+
The GSEA-P-R program reads input files in *.gct, *.cls and *.gmt formats. As you can see from the examples's files these are simple tab separated ASCII files. If your datasets are not in this format you can use a text editor to convert them. If you start with a tab separated ASCII file, typically the conversion would consist in modifying the header lines on top of the file. Please note that  GSEA-P-R requires that the *.cls file has two and only two phenotype classes.<br />
            <td>Is this gene is the &quot;core enrichment&quot; section of the list? Yes or No variable specifying if the gene location is before (positive ES) or after (negative ES) the running enrichment peak.</td>
+
<br />
        </tr>
+
If you have questions or problems running or using the program please&nbsp; send them to gsea@broadinstitute.org. Also lets us know if you find GSEA a useful tool in your work.
    </tbody>
 
</table>
 
<br /> The rows are sorted by the gene location in the gene list. <br /><br /> The function call to GSEA returns a two element list containing the two global result reports as data frames ($report1, $report2).
 

Latest revision as of 16:29, 28 August 2019

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

The GSEA program is provided as an standalone R program, which is available on the Archived Downloads page. Note that the R program was last updated in 2005 and may not work as-is with modern R releases. It is made available for reference purposes only and is no longer maintained or supported.

A readme file included with the R program contains instructions on how to run the program. The readme file is reproduced below for your convenience.

Note: The GSEA-P-R program has the following limitations:

  • requires exactly two phenotype classes
  • does not collapse dataset to gene symbols
  • does not perform permutations by gene_set

These are the instructions to run the R version of the GSEA program (GSEA-P-R.ZIP). There is a more user friendly version of GSEA-P written in Java, the GSEA desktop application. If you want to run GSEA and you are not a programmer or a computational biologist that version may be a better choice. The R version is intended for more computational experienced biologists, bioinformaticians or computational biologists who are familiar with GSEA algorithm and want to use the R implementation to further explore GSEA method.

The GSEA-P-R program described here reflects the version of the methodology described and used in the Subramanian and Tamayo et al 2005 paper. For details about the method and the content of the output please see Supporting Information for that paper.

You need to install R release 2.0 or later.

- Copy the GSEA-P-R.ZIP file to your computer.
- Unzip the file GSEA-P-R.ZIP using the option to create subdirectories.
  This should create the following files and subdirectories:

GSEA program and functions in R (all the GSEA code is contained there):
GSEA/GSEA-P-R/GSEA.1.0.R

Directory with input datasets, gct and cls files:
GSEA/GSEA-P-R/Datasets/
Gender.gct
Gender.cls
Leukemia.gct
Leukemia.cls
Lung_Boston.gct
Lung_Boston.cls
Lung_Michigan.gct
Lung_Michigan.cls
Lung_Stanford.gct
Lung_Stanford.cls
Lung_Bost_maxed_common_Mich_Bost.gct
Lung_Mich_maxed_common_Mich_Bost.gct
P53.gct
P53.cls

Directory with gene set databases, gmt files:
GSEA/GSEA-P-R/GeneSetDatabases/
C1.gmt
C2.gmt
C3.gmt
C4.gmt
Lung_Boston_poor_outcome.gmt
Lung_Michigan_poor_outcome.gmt

Directories with results of running the examples described in the paper:

GSEA/GSEA-P-R/Gender_C1/
Gender_C2
Leukemia_C1
Lung_Boston_C2
Lung_Stanford_C2
Lung_Michigan_C2
Lung_Boston_outcome
Lung_Michigan_outcome
P53_C2

The top 20 high scoring gene sets are reported in table S2 (Supporting Information).

One page R scripts to run the examples described in the paper:

GSEA/GSEA-P-R/
Run.Gender_C1.R
Run.Gender_C2.R
Run.Leukemia_C1.R
Run.Lung_Boston_C2.R
Run.Lung_Stanford_C2.R
Run.Lung_Michigan_C2.R
Run.Lung_Boston_outcome.R
Run.Lung_Michigan_outcome.R
Run.P53_C2.R

To run, for example, the Leukemia dataset with the C1 gene set database go to the file GSEA/GSEA-P-R/Run.Leukemia_C1.R and change the file pathnames to reflect the location of the GSEA directory in your machine. For example if you expanded the ZIP file under your directory "C:/my_directory" you need to change the line:

GSEA.program.location <- "d:/CGP2005/GSEA/GSEA-P-R/GSEA.1.0.R"
To:

GSEA.program.location <- "c:my_directory/GSEA/GSEA-P-R/GSEA.1.0.R"
And the same change to each pathname in that file: you need to replace "d:/CGP2005" with "C"/my_directory".

You may also want to change the line:

doc.string = "Leukemia_C1",

To:
doc.string = "my_run_of_Leukemia_C1",

or any other prefix label you want to give your results. This way you won't overwrite the original results that come in those directories and can use them for comparison with the results of you own run.

After the pathnames have been changed to reflect the location of the directories in your machine to run GSEA program just open the R GUI and paste the content of the
Run.<example>.R
files on it.
For example, to run the Leukemia vs. C1 example, use the contents of the file "Run.Leukemia_C1.R". The program is self-contained and should run and produce the results under the directory "C:my_directory/GSEA/GSEA-P-R/Leukemia_C1". These files are set up with the parameters used in the examples of the paper (e.g. to produce detailed results for the significant and top 20 gene sets). You may want to start using these parameters and change them only when needed and when you get more experience with the program. For details on the effects of changing some of the parameters, see the Supporting Information document.

If you want to run a completely new dataset the easiest way is:

  1. Create a new directory: e.g. GSEA/GSEA-P-R/my_dataset, where you can store the inputs and outputs of running GSEA on those files.
  2. Convert manually your files to *.gct (expression dataset) and *.cls (phenotype labels)
  3. Use Run.Leukemia_C1.R as a template to make a new script to run your data.
  4. Change the relevant pathnames to point to your input files in directory my_dataset. Change the doc.string to an approprote prefix name for your files.
  5. Cut and paste the contents of this new script file in the R GUI to run it. The results will be stored in my_directory.

The GSEA-P-R program reads input files in *.gct, *.cls and *.gmt formats. As you can see from the examples's files these are simple tab separated ASCII files. If your datasets are not in this format you can use a text editor to convert them. If you start with a tab separated ASCII file, typically the conversion would consist in modifying the header lines on top of the file. Please note that GSEA-P-R requires that the *.cls file has two and only two phenotype classes.

If you have questions or problems running or using the program please  send them to gsea@broadinstitute.org. Also lets us know if you find GSEA a useful tool in your work.