R-GSEA Readme

From GeneSetEnrichmentAnalysisWiki
Revision as of 08:05, 12 July 2006 by Hkuehn (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The GSEA program is provided as an standalone R program. This file describes the procedure to run the R program to reproduce the results shown in the paper. These are examples of the use of the method and can easily be modified to work on other datasets in the user's computer.

The zip file (GSEA.Examples.zip ) contains all the data, R scripts and results of the examples described in the paper.

To run the R program first you have to expand theGSEA.Examples.zip in a directory of your choice on your machine (lets call it “my-directory”). It should create the following subdirectory structure:

<tbody> </tbody>
<my-directory>/GSEA/Examples/* Has example scripts for all the examples in the paper and dataset subdirectories for each example. It laso has the complete results for the ALLAML_S1 example.
<my-directory>/GSEA/AnnotationFiles/* Gene annotation files.
<my-directory>/GSEA/GeneSetDatabases/* Gene sets databases and annotation files. The gene annotation files are Affymetrix's.
<my-directory>/GSEA/method/GSEA.R GSEA R program.
<my-directory>/GSEA/method/Documentation Documentation for individual GSEA R functions (the way is done for an R package)
<my-directory>/GSEA/README.txt The README file (containing this information)



Then to run a specific example do the following (these are for the Leukemia S1 example):

Use a text editor to open <my-directory>/GSEA/Examples/Run.ALLAML_S1.R.
In that file change the file pathnames (there are 7 of them) to point to the right directories in your computer. For example, if you want to run using the same directory structure as the examples, replace all instances of "d:/CGP2004" with "<my-directory>".

That is all you need to make the script ready to run on your machine. You can also change the "nperm" parameters to 20 instead of 1000 to make a quick run and test that everything is ok before making the longer actual run of the example. After changing the pathnames then you can just cut and paste its content into an R console or "source" the file from your R command line. The R script “sources” the GSEA program from <my-directory>/GSEA/method/GSEA.R and then calls the GSEA program with all the files and parameter settings. The GSEA program will run and produce (overwrite) the files in directory <my-directory>/GSEA/Examples/ALLAML_S1. You can move or delete all the result files under ALLAML_S1 to another place prior to run the script (but leave the input datasets allaml.dataset.gct and allaml.phenotype.cls). If you didn't save the result files and want to check if your run produced the same results you always can go back to the original files in the zip file. When you run those scripts with the same parameters (remember to set nperm = 1000 in case you changed for a quick run) you should obtain the same identical results included as the original results files (*.report.txt, global.plots.jpeg, etc.)

In the same way you can reproduce the results of the other examples. Once you manage to do this you can try the GSEA method on your own data by just replacing the input datasets and potentially the gene set databases. The gene set databases have been updated frequently so we have saved the particular version that was used in the original run of each example. If you want generic versions of the gene set databases use the ones with the explicit chip type in the name such as s1.hgu95av2.gmt.

Description of GSEA output

The results of the GSEA are stored in the "output.directory" specified by the user as part of the input parameters to the GSEA R program. The results files are:

  • Two tab-separated global results text files (one for each phenotype). These files are labeled according to the doc string prefix and the phenotype name from the CLS (class) file: <doc.string>.results.report.<phenotype>.txt
  • One set of global plots. They include a) gene list correlation profile, b) global observed and null densities, c) heat map for the entire sorted dataset, and d) p-values vs. NES plot. These plots are in a single JPEG file named <doc.string>.global.plots.<phenotype>.jpg. When the program is run interactively these plots appear on a window in the R GUI.
  • A variable number of tab-separated gene set results files according to how many sets pass any of the significance thresholds ("nom.p.val.threshold," "fwer.p.val.threshold," and "fdr.q.val.threshold") and how many are specified in the "topgs" parameter. These files are named: <doc.string>.<gene set name>.report.txt.
  • A variable number of gene set plots (one for each gene set report file). These plots include a) gene set running enrichment "mountain" plot, b) gene set null distribution and c) heat map for genes in the gene set. These plots are stored in a single JPEG file named <doc.string>.<gene set name>.jpg.


The format (columns) for the global result files is as follows.

<tbody> </tbody>
GS : Gene set name.
SIZE : Number of genes in the set.
SOURCE : Set definition or source.
ES : Enrichment score.
NES : Normalized (multiplicative rescaling) normalized enrichment score.
NOM p-val : Nominal p-value (from the null distribution of the gene set).
FDR q-val : False discovery rate q-values.
FWER p-val : Family wise error rate p-values.
Tag %: Percent of gene set before running enrichment peak.
Gene %: Percent of gene list before running enrichment peak.
Signal : Enrichment signal strength.
FDR(median): FDR q-values from the median of the null distributions.
glob.p.val : P-value using a global statistic (number of sets above the given set's NES).


The rows are sorted by the NES values (from maximum positive or negative NES to minimum)

The format (columns) for the individual gene set result files is as follows.

<tbody> </tbody>
# : Gene number in the (sorted) gene set.
PROBE_ID : The gene name or accession number in the dataset.
SYMBOL : The gene symbol from the gene annotation file.
DESC : The gene description (title) from the gene annotation file.
LIST LOC : The location of the gene in the sorted gene list.
S2N : The signal to noise ratio (correlation) of the gene in the gene list.
RES : The value of the running enrichment score at the gene location.
CORE_ENRICHMENT: Is this gene is the "core enrichment" section of the list? Yes or No variable specifying if the gene location is before (positive ES) or after (negative ES) the running enrichment peak.


The rows are sorted by the gene location in the gene list.

The function call to GSEA returns a two element list containing the two global result reports as data frames ($report1, $report2).