Difference between revisions of "R-GSEA Readme"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
 
Line 1: Line 1:
The GSEA program is provided as an standalone R program. This file describes the procedure to run the R program to reproduce the results shown in the paper. These are examples of the use of the method and can easily be modified to work on other datasets in the user's computer. <br /><br /> The zip file (<span class="unix">GSEA.Examples.zip</span> ) contains all the data, R scripts and results of the examples described in the paper.   <br /><br /> To run the R program first you have to expand the<span class="unix">GSEA.Examples.zip</span> in a directory of your choice on your machine (lets call it <span class="unix">&ldquo;my-directory&rdquo;</span>). It should create the following subdirectory structure: <br /><br />
+
The GSEA program is provided as an standalone R program, which is available on the [http://www.broad.mit.edu/gsea/software/software_index.html Software] page.<br />
<table width="100%" cellspacing="4" cellpadding="4">
+
<br />
    <tbody>
+
These are the instructions to run the R version of the GSEA program (GSEA-P-R.ZIP). Notice that there is a more user friendly version of GSEA-P written in Java, the GSEA desktop application. If you want to run GSEA and you are not a programmer or a computational biologist that version may be a better choice. The R version is intended for more computational experienced biologists, bioinformaticians or computational biologists. <br />
        <tr>
+
<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/Examples/*</td>
+
The GSEA-P-R program described here reflects the version of the methodology described and used in the Subramanian and Tamayo et al 2005 paper. For details about the method and the content of the output please see Supporting Information for that paper.<br />
            <td>Has example scripts for all the examples in the paper and dataset subdirectories for each example. It laso has the complete results for the ALLAML_S1 example.</td>
+
<br />
        </tr>
+
You need to install R release 2.0 or later.<br />
        <tr>
+
<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/AnnotationFiles/*</td>
+
- Copy the GSEA-P-R.ZIP file to your computer. <br />
            <td>Gene annotation files.</td>
+
- Unzip the file GSEA-P-R.ZIP using the option to create subdirectories.<br />
        </tr>
+
&nbsp; This should create the following files and subdirectories:<br />
        <tr>
+
<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/GeneSetDatabases/*</td>
+
GSEA program and functions in R (all the GSEA code is conatined there):<br />
            <td>Gene sets databases and annotation files. The gene annotation files are Affymetrix's.</td>
+
<br />
        </tr>
+
GSEA/GSEA-P-R/GSEA.1.0.R&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
        <tr>
+
<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/method/GSEA.R</td>
+
Directory with input datasets, gct and cls files:<br />
            <td>GSEA R program.</td>
+
&nbsp; GSEA/GSEA-P-R/Datasets/&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Gender.gct<br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Gender.cls<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/method/Documentation</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Leukemia.gct<br />
            <td>Documentation for individual GSEA R functions (the way is done for an R package)</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Leukemia.cls<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Boston.gct<br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Boston.cls<br />
            <td class="unix">&lt;my-directory&gt;/GSEA/README.txt</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Michigan.gct<br />
            <td>The README file (containing this information)</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Michigan.cls<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Stanford.gct<br />
    </tbody>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Stanford.cls<br />
</table>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Bost_maxed_common_Mich_Bost.gct<br />
<br /><br /> Then to run a specific example do the following (these are for the Leukemia S1 example): <br /><br /> Use a text editor to open <span class="unix">&lt;my-directory&gt;/GSEA/Examples/Run.ALLAML_S1.R.</span><br /> In that file change the file pathnames (there are 7 of them) to point to the right directories in your computer. For example, if you want to run using the same directory structure as the examples, replace all instances of <span class="unix">&quot;d:/CGP2004&quot;</span> with <span class="unix">&quot;&lt;my-directory&gt;&quot;</span>. <br /><br /> That is all you need to make the script ready to run on your machine. You can also change the &quot;nperm&quot; parameters to 20 instead of 1000 to make a quick run and test that everything is ok before making the longer actual run of the example. After changing the pathnames then you can just cut and paste its content into an R console or &quot;source&quot; the file from your R command line. The R script &ldquo;sources&rdquo; the GSEA program from <span class="unix">&lt;my-directory&gt;/GSEA/method/GSEA.R</span> and then calls the GSEA program with all the files and parameter settings. The GSEA program will run and produce (overwrite) the files in directory <span class="unix">&lt;my-directory&gt;/GSEA/Examples/ALLAML_S1</span>. You can move or delete all the result files under <span class="unix">ALLAML_S1</span> to another place prior to run the script (but leave the input datasets <span class="unix">allaml.dataset.gct</span> and <span class="unix">allaml.phenotype.cls</span>). If you didn't save the result files and want to check if your run produced the same results you always can go back to the original files in the zip file. When you run those scripts with the same parameters (remember to set nperm = 1000 in case you changed for a quick run) you should obtain the same identical results included as the original results files (<span class="unix">*.report.txt, global.plots.jpeg,</span> etc.) <br /><br /> In the same way you can reproduce the results of the other examples. Once you manage to do this you can try the GSEA method on your own data by just replacing the input datasets and potentially the gene set databases. The gene set databases have been updated frequently so we have saved the particular version that was used in the original run of each example. If you want generic versions of the gene set databases use the ones with the explicit chip type in the name such as <span class="unix">s1.hgu95av2.gmt.</span> <br /><br />
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Mich_maxed_common_Mich_Bost.gct<br />
<h2>Description of GSEA output</h2>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P53.gct<br />
The results of the GSEA are stored in the <span class="unix">&quot;output.directory&quot;</span> specified by the user as part of the input parameters to the GSEA R program. The results files are: <br /><br />
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P53.cls<br />
<ul>
+
<br />
    <li>Two tab-separated global results text files (one for each phenotype). These files are labeled according to the doc string prefix and the phenotype name from the CLS (class) file: <span class="unix">&lt;doc.string&gt;.results.report.&lt;phenotype&gt;.txt</span></li>
+
Directory with gene set databases, gmt files:<br />
    <li>One set of global plots. They include a) gene list correlation profile, b) global observed and null densities, c) heat map for the entire sorted dataset, and d) p-values vs. NES plot. These plots are in a single JPEG file named <span class="unix">&lt;doc.string&gt;.global.plots.&lt;phenotype&gt;.jpg</span>. When the program is run interactively these plots appear on a window in the R GUI.</li>
+
&nbsp; GSEA/GSEA-P-R/GeneSetDatabases/<br />
    <li>A variable number of tab-separated gene set results files according to how many sets pass any of the significance thresholds (<span class="unix">&quot;nom.p.val.threshold,&quot; &quot;fwer.p.val.threshold,&quot;</span> and <span class="unix">&quot;fdr.q.val.threshold&quot;</span>) and how many are specified in the &quot;topgs&quot; parameter. These files are named: <span class="unix">&lt;doc.string&gt;.&lt;gene set name&gt;.report.txt</span>. </li>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C1.gmt<br />
    <li>A variable number of gene set plots (one for each gene set report file). These plots include a) gene set running enrichment &quot;mountain&quot; plot, b) gene set null distribution and c) heat map for genes in the gene set. These plots are stored in a single JPEG file named <span class="unix">&lt;doc.string&gt;.&lt;gene set name&gt;.jpg</span>.</li>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C2.gmt<br />
</ul>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C3.gmt<br />
<br /> The format (columns) for the global result files is as follows. <br /><br />
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C4.gmt<br />
<table width="100%" cellspacing="4" cellpadding="4">
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Boston_poor_outcome.gmt<br />
    <tbody>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Michigan_poor_outcome.gmt<br />
        <tr>
+
<br />
            <td align="right" class="unix"><strong>GS :</strong></td>
+
Directories with results of running the examples described in the paper:<br />
            <td>Gene set name.</td>
+
<br />
        </tr>
+
&nbsp; GSEA/GSEA-P-R/Gender_C1/<br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Gender_C2<br />
            <td align="right" class="unix"><strong>SIZE :</strong></td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Leukemia_C1<br />
            <td>Number of genes in the set.</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Boston_C2<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Stanford_C2 <br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Michigan_C2<br />
            <td align="right" class="unix"><strong>SOURCE :</strong></td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Boston_outcome <br />
            <td>Set definition or source.</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lung_Michigan_outcome<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P53_C2<br />
        <tr>
+
<br />
            <td align="right" class="unix"><strong>ES :</strong></td>
+
The top 20 high scoring gene sets are reported in table S2 (Supporting Information).<br />
            <td>Enrichment score.</td>
+
<br />
        </tr>
+
One page R scripts to run the examples described in the paper:<br />
        <tr>
+
<br />
            <td align="right" class="unix"><strong>NES :</strong></td>
+
&nbsp; GSEA/GSEA-P-R/<br />
            <td>Normalized (multiplicative rescaling) normalized enrichment score.</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Gender_C1.R<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Gender_C2.R<br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Leukemia_C1.R<br />
            <td align="right" class="unix"><strong>NOM p-val :</strong></td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Lung_Boston_C2.R<br />
            <td>Nominal p-value (from the null distribution of the gene set).</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Lung_Stanford_C2.R<br />
        </tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Lung_Michigan_C2.R<br />
        <tr>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Lung_Boston_outcome.R<br />
            <td align="right" class="unix"><strong>FDR q-val :</strong></td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.Lung_Michigan_outcome.R<br />
            <td>False discovery rate q-values.</td>
+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Run.P53_C2.R<br />
        </tr>
+
<br />
        <tr>
+
To run, for example, the Leukemia dataset with the C1 gene set database go to the file GSEA/GSEA-P-R/Run.Leukemia_C1.R and change the file pathnames to reflect the location of the GSEA directory in your machine. For example if you expanded the ZIP file under your directory &quot;C:/my_directory&quot; you need to change the line: <br />
            <td align="right" class="unix"><strong>FWER p-val :</strong></td>
+
<br />
            <td>Family wise error rate p-values.</td>
+
GSEA.program.location &lt;- &quot;d:/CGP2005/GSEA/GSEA-P-R/GSEA.1.0.R&quot; &nbsp;<br />
        </tr>
+
<br />
        <tr>
+
To:<br />
            <td align="right" class="unix"><strong>Tag %:</strong></td>
+
<br />
            <td>Percent of gene set before running enrichment peak.</td>
+
GSEA.program.location &lt;- &quot;c:my_directory/GSEA/GSEA-P-R/GSEA.1.0.R&quot;<br />
        </tr>
+
<br />
        <tr>
+
&nbsp;And the same change to each pathname in that file: you need to replace &quot;d:/CGP2005&quot; with &quot;C&quot;/my_directory&quot;.<br />
            <td align="right" class="unix"><strong>Gene %:</strong></td>
+
<br />
            <td>Percent of gene list before running enrichment peak.</td>
+
&nbsp;You may also want to change the line:<br />
        </tr>
+
<br />
        <tr>
+
doc.string&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = &quot;Leukemia_C1&quot;,<br />
            <td align="right" class="unix"><strong>Signal :</strong></td>
+
<br />
            <td>Enrichment signal strength.</td>
+
To:<br />
        </tr>
+
<br />
        <tr>
+
doc.string&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = &quot;my_run_of_Leukemia_C1&quot;,<br />
            <td align="right" class="unix"><strong>FDR(median):</strong></td>
+
<br />
            <td>FDR q-values from the median of the null distributions.</td>
+
or any other prefix label you want to give your results. This way you won't overwrite the original results that come in those directories and can use them for comparison with the results of you own run. <br />
        </tr>
+
<br />
        <tr>
+
After the pathnames have been changed to reflect the location of the directories in your machine to run GSEA program just open the R GUI and paste the content of the Run.&lt;example&gt;.R files on it.&nbsp; Fro example to run the Leukemia vs. C1 example use the contents of the file &quot;Run.Leukemia_C1.R&quot; The program is self-contained and should run and produce the results under the directory &quot;C:my_directory/GSEA/GSEA-P-R/Leukemia_C1&quot;. These files are set up with the parameters used in the examples of the paper (e.g. to produce detailed results for the significant and top 20 gene sets). You may want to start using these parameters and change them only when needed and when you get mnore experience with the program. For details of what are the effect of changing some of the parameters see the Supporting Information document.<br />
            <td align="right" class="unix"><strong>glob.p.val :</strong></td>
+
<br />
            <td>P-value using a global statistic (number of sets above the given set's NES).</td>
+
If you want to run a completely new dataset the easiest way is:<br />
        </tr>
+
<br />
    </tbody>
+
i) Create a new directory: e.g. GSEA/GSEA-P-R/my_dataset, where you can store the inputs and outputs of running GSEA on those files. <br />
</table>
+
ii) Convert manually your files to *.gct (expression dataset) and *.cls (phenotype labels)<br />
<br /> The rows are sorted by the NES values (from maximum positive or negative NES to minimum) <br /><br /> The format (columns) for the individual gene set result files is as follows. <br /><br />
+
iii) Use Run.Leukemia_C1.R as a template to make a new script to run your data.<br />
<table width="100%" cellspacing="4" cellpadding="4">
+
iv) Change the relevant pathnames to point to your input files in directory my_dataset. Change the doc.string to an approprote prefix name for your files.<br />
    <tbody>
+
v) Cut and paste the contents of this new script file in the R GUI to run it. The results will be stored in my_directory.<br />
        <tr>
+
<br />
            <td align="right" class="unix"><strong># :</strong></td>
+
The GSEA-P-R program reads input files in *.gct, *.cls and *.gmt formats. As you can see from the examples's files these are simple tab separated ASCII files. If your datasets are not in this format you can use a text editor to convert them. If you start with a tab separated ASCII file tipically the conversion would consist in&nbsp; modifying the header lines on top of the file.<br />
            <td>Gene number in the (sorted) gene set.</td>
+
<br />
        </tr>
+
If you have questions or problems running or using the program please&nbsp; send them to gsea@broad.mit.edu. Also lets us know if you find GSEA a useful tool in your work.
        <tr>
 
            <td align="right" class="unix"><strong>PROBE_ID :</strong></td>
 
            <td>The gene name or accession number in the dataset.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>SYMBOL :</strong></td>
 
            <td>The gene symbol from the gene annotation file.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>DESC :</strong></td>
 
            <td>The gene description (title) from the gene annotation file.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>LIST LOC :</strong></td>
 
            <td>The location of the gene in the sorted gene list.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>S2N :</strong></td>
 
            <td>The signal to noise ratio (correlation) of the gene in the gene list.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>RES :</strong></td>
 
            <td>The value of the running enrichment score at the gene location.</td>
 
        </tr>
 
        <tr>
 
            <td align="right" class="unix"><strong>CORE_ENRICHMENT:</strong></td>
 
            <td>Is this gene is the &quot;core enrichment&quot; section of the list? Yes or No variable specifying if the gene location is before (positive ES) or after (negative ES) the running enrichment peak.</td>
 
        </tr>
 
    </tbody>
 
</table>
 
<br /> The rows are sorted by the gene location in the gene list. <br /><br /> The function call to GSEA returns a two element list containing the two global result reports as data frames ($report1, $report2).
 

Revision as of 13:58, 10 January 2007

The GSEA program is provided as an standalone R program, which is available on the Software page.

These are the instructions to run the R version of the GSEA program (GSEA-P-R.ZIP). Notice that there is a more user friendly version of GSEA-P written in Java, the GSEA desktop application. If you want to run GSEA and you are not a programmer or a computational biologist that version may be a better choice. The R version is intended for more computational experienced biologists, bioinformaticians or computational biologists.

The GSEA-P-R program described here reflects the version of the methodology described and used in the Subramanian and Tamayo et al 2005 paper. For details about the method and the content of the output please see Supporting Information for that paper.

You need to install R release 2.0 or later.

- Copy the GSEA-P-R.ZIP file to your computer.
- Unzip the file GSEA-P-R.ZIP using the option to create subdirectories.
  This should create the following files and subdirectories:

GSEA program and functions in R (all the GSEA code is conatined there):

GSEA/GSEA-P-R/GSEA.1.0.R        

Directory with input datasets, gct and cls files:
  GSEA/GSEA-P-R/Datasets/        
                         Gender.gct
                         Gender.cls
                         Leukemia.gct
                         Leukemia.cls
                         Lung_Boston.gct
                         Lung_Boston.cls
                         Lung_Michigan.gct
                         Lung_Michigan.cls
                         Lung_Stanford.gct
                         Lung_Stanford.cls
                         Lung_Bost_maxed_common_Mich_Bost.gct
                         Lung_Mich_maxed_common_Mich_Bost.gct
                         P53.gct
                         P53.cls

Directory with gene set databases, gmt files:
  GSEA/GSEA-P-R/GeneSetDatabases/
                                 C1.gmt
                                 C2.gmt
                                 C3.gmt
                                 C4.gmt
                                 Lung_Boston_poor_outcome.gmt
                                 Lung_Michigan_poor_outcome.gmt

Directories with results of running the examples described in the paper:

  GSEA/GSEA-P-R/Gender_C1/
                          Gender_C2
                          Leukemia_C1
                          Lung_Boston_C2
                          Lung_Stanford_C2
                          Lung_Michigan_C2
                          Lung_Boston_outcome
                          Lung_Michigan_outcome
                          P53_C2

The top 20 high scoring gene sets are reported in table S2 (Supporting Information).

One page R scripts to run the examples described in the paper:

  GSEA/GSEA-P-R/
                Run.Gender_C1.R
                Run.Gender_C2.R
                Run.Leukemia_C1.R
                Run.Lung_Boston_C2.R
                Run.Lung_Stanford_C2.R
                Run.Lung_Michigan_C2.R
                Run.Lung_Boston_outcome.R
                Run.Lung_Michigan_outcome.R
                Run.P53_C2.R

To run, for example, the Leukemia dataset with the C1 gene set database go to the file GSEA/GSEA-P-R/Run.Leukemia_C1.R and change the file pathnames to reflect the location of the GSEA directory in your machine. For example if you expanded the ZIP file under your directory "C:/my_directory" you need to change the line:

GSEA.program.location <- "d:/CGP2005/GSEA/GSEA-P-R/GSEA.1.0.R"  

To:

GSEA.program.location <- "c:my_directory/GSEA/GSEA-P-R/GSEA.1.0.R"

 And the same change to each pathname in that file: you need to replace "d:/CGP2005" with "C"/my_directory".

 You may also want to change the line:

doc.string            = "Leukemia_C1",

To:

doc.string            = "my_run_of_Leukemia_C1",

or any other prefix label you want to give your results. This way you won't overwrite the original results that come in those directories and can use them for comparison with the results of you own run.

After the pathnames have been changed to reflect the location of the directories in your machine to run GSEA program just open the R GUI and paste the content of the Run.<example>.R files on it.  Fro example to run the Leukemia vs. C1 example use the contents of the file "Run.Leukemia_C1.R" The program is self-contained and should run and produce the results under the directory "C:my_directory/GSEA/GSEA-P-R/Leukemia_C1". These files are set up with the parameters used in the examples of the paper (e.g. to produce detailed results for the significant and top 20 gene sets). You may want to start using these parameters and change them only when needed and when you get mnore experience with the program. For details of what are the effect of changing some of the parameters see the Supporting Information document.

If you want to run a completely new dataset the easiest way is:

i) Create a new directory: e.g. GSEA/GSEA-P-R/my_dataset, where you can store the inputs and outputs of running GSEA on those files.
ii) Convert manually your files to *.gct (expression dataset) and *.cls (phenotype labels)
iii) Use Run.Leukemia_C1.R as a template to make a new script to run your data.
iv) Change the relevant pathnames to point to your input files in directory my_dataset. Change the doc.string to an approprote prefix name for your files.
v) Cut and paste the contents of this new script file in the R GUI to run it. The results will be stored in my_directory.

The GSEA-P-R program reads input files in *.gct, *.cls and *.gmt formats. As you can see from the examples's files these are simple tab separated ASCII files. If your datasets are not in this format you can use a text editor to convert them. If you start with a tab separated ASCII file tipically the conversion would consist in  modifying the header lines on top of the file.

If you have questions or problems running or using the program please  send them to gsea@broad.mit.edu. Also lets us know if you find GSEA a useful tool in your work.