Xtools.gsea.Gsea

From GeneSetEnrichmentAnalysisWiki
Revision as of 15:58, 21 March 2006 by Hkuehn (talk | contribs)
Jump to navigation Jump to search

Use the Run GSEA page to run the gene set enrichment analysis. To display this page, click the Run GSEA icon in the GSEA main window.
Place your cursor on a parameter name to see a brief description of the parameter. To run the analysis, set the parameters and click Run. You can also run the analysis from the command line, as described in <a href="#_Running_GSEA_from_the Command Line">Running GSEA from the Command Line</a>.

Required Fields

Required Fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

        Expression dataset. Select an <a href="#_Expression_Datasets">expression dataset</a> from the drop-down list. If the dataset is not listed, you have not yet loaded it; see <a href="#_Loading_Data_1">Loading Data</a>.

        Phenotype labels. Click the ellipse (…) button to display the following window, which allows you to select one or more <a href="#_Phenotype_Labels_1">phenotypes</a> to analyze. GSEA will analyze each phenotype separately, producing a single report that contains all of the analysis results. For more information about this window, see <a href="#_Select_One_or_More Phenotypes Windo">Select One or More Phenotypes Window</a>.

<img width="357" height="274" border="0" v:shapes="_x0000_i1026" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image002.gif" alt="" />

        Gene sets database. Click the ellipse (…) button and select one or more <a href="#_Gene_Sets">gene sets</a>. For convenience, the gene set selection window lists different gene set formats in separate tabs:

         GeneSets(grp), which lists ** these are always in mem lists created by GSEA?**

         GeneMatrix (gmx/gmt), which lists gene set files that you have loaded.

         Subsets, which lists each gene set in each gmx/gmt file

         Text Entry, which allows you to create a gene set by entering the genes for that gene set; the gene set is created in memory and deleted when you exit from GSEA

The array (DNA chip) format of your gene sets must match that of your expression dataset, as described in <a href="#_Consistent_Feature_Identifiers_Acro">Consistent Feature Identifiers Across Data Files</a>.

        Number of permutations. Specify the number of permutations to perform in assessing the statistical significance of the enrichment score. It is best to start with a small number, such as 10. After the analysis completes successfully, run it again with a full set of permutations. The GSEA recommends 1000 phenotype permutations or 10000 tag permutations.

        Permutation type. Select the type of permutation to perform in assessing the statistical significance of the enrichment score:

         Phenotype. Random phenotypes are created by shuffling the phenotype labels on the samples. For each random phenotype, a ranked gene list is calculated (using the same metric as with the actual phenotype) and enrichment of gene sets are scored. These enrichment scores are used to create a null distribution from which the significance of enrichment scores of the actual phenotype is calculated. Refer to the <Gene Set Enrichment Analysis> PNAS paper for details. This is the recommended method when there are at least ten samples in each class. **Aravind says 7 samples; Pablo says 10**

         Tag. Random gene sets, size matched to the actual gene set, are created and their enrichment scores calculated. These enrichment scores are used to create a null distribution from which the significance of enrichment scores of the actual gene sets are calculated. This method is useful when you have too few samples to do phenotype permutations (that is, when you have fewer than ten samples in any class).

The GSEA team recommends using phenotype permutation whenever possible. The phenotype permutation shuffles the phenotype labels on the samples in the dataset; it does not modify gene sets. Therefore, the correlations between the genes in the dataset and the genes in a gene set are preserved across phenotype permutations. The tag permutation creates random gene sets; therefore, the correlations between the genes in the dataset and the genes in the gene set are not preserved across tag permutations. Preserving the gene-to-gene correlation across permutations provides a more biologically reasonable (more stringent) assessment of significance.

        Analyze in this feature space. **Call this “Gene/probe identifier format” ** Select one of the following options:

         gene_symbols (default). GSEA uses the CollapseDataset tool to collapse the probe sets in the expression dataset. Each probe set is collapsed into a single vector for the gene, which get identified by its HUGO gene symbol. The genes being analyzed are identified by HUGO gene symbol, so the gene sets that you specify must also identify genes by HUGO gene symbol. For more information, see <a href="#_Consistent_Feature_Identifiers_Acro">Consistent Feature Identifiers Across Data Files</a>.

         native. GSEA does not collapse the probe sets in the expression dataset. The features (genes or probes) being analyzed are identified by the probe identifiers native to the dataset. The features (genes or probes) in the expression dataset and the genes in your gene sets must use the same feature identifiers. For more information, see <a href="#_Consistent_Feature_Identifiers_Acro">Consistent Feature Identifiers Across Data Files</a>.

        Chip platform. Select the array (DNA chip) format that matches your expression dataset. If the chip that you need is not listed, download the array annotation file for that chip, as described in <a href="#_Downloading_Array_Annotations">Downloading Array Annotations</a>. Alternatively, you can leave this field blank; however, genes will not be annotated in the analysis report.

Basic Fields

Basic Fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. (Click Show/Hide to display and hide these parameters.)

        Analysis name. A short descriptive label for the analysis. The name cannot include spaces. This label is used as a prefix when naming the output report generated by the analysis (for example, my_analysis.Gsea.1130510139575.rpt).

        Enrichment statistic. As described in the <Gene Set Enrichment Analysis> PNAS paper, to calculate the enrichment score, GSEA first walks down the ranked list of genes increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The enrichment score is the maximum deviation from zero encountered during that walk. This is the running-sum statistic used for the analysis.

The last section of the <Gene Set Enrichment Analysis> PNAS paper shows the mathematical descriptions of the methods used in GSEA. This option controls the value of p used in the enrichment score calculation shown there:

         classic: p=0

         weighted (default): p=1

         weighted_p2: p=2

         weighted_p1.5: p=1.5

        Metric for ranking genes. As described in the <Gene Set Enrichment Analysis> PNAS paper, GSEA first orders genes in a ranked list according to their differential expression. Use this parameter to select the metric that GSEA uses to rank the genes.

         Signal2Noise (default). The signal-to-noise ratio uses the following formula to determine differential gene expression with respect to two phenotypes:

mean in class_a - mean in class_b / (standard deviation class_a + standard deviation class_b)

The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.” (To have GSEA use the median of class_a minus the median of class_b in this calculation, select Options>Use median instead of mean for metrics.)

         tTest. **need explanations**

         Cosine.

         Euclidean.

         <st1:place w:st="on"><st1:city w:st="on">Manhattan</st1:city></st1:place>.

         Pearson.

         None.

         Bhattacharyya.

If you want to use a ranking metric other than those listed here, or you have a ranked list of genes that you want to analyze, you can use the <a href="#_Preferences_Window">GSEAPreranked</a> analysis to have GSEA analyze a list of ranked genes that you provide.

        Gene list sorting mode. Mode (real, absolute) in which scores from the gene list should be considered.

        Gene list ordering mode. Direction (ascending, descending) in which the gene list should be ordered.

        Max size. Gene sets larger than this are excluded from the analysis.

        Min size. Gene sets smaller than this are excluded from the analysis.

        Save results in this folder. Path of the directory in which to place the analysis results. Existing results in this folder are not overwritten. By default, analysis results are saved in the GSEA Results Folder. To view this folder, select Help>GSEA Results Folder.

Advanced Fields

Advanced Fields lists parameters that control details of the GSEA algorithm and its Java implementation. Do not change the default values of these parameters unless you are conversant with the algorithm and its Java implementation. (Click Show/Hide to display and hide these parameters.)

        Collapsing mode for probe sets => 1 gene. Used only when the Analyze in this feature space parameter is set to gene_symbols. Select the value to use for the single probe that will represent all probe sets for the gene: max_probe (default) to use the highest expression value or median_of_probes to use the median value.

        Normalization mode. Method used to normalize the enrichment score for each gene set to account for the size of the set. The last section of the <Gene Set Enrichment Analysis> PNAS paper shows the mathematical descriptions of the methods used in GSEA. This option controls the normalization method used to adjust for variation in gene set size during multiple hypothesis testing:

         meandiv (default): GSEA divides by the mean (over random permutation for the same set) to normalize the enrichment scores.

         varmean: GSEA divides by the variance (over random permutation for the same set) to normalize the enrichment scores. ** this is now VarMeanPosNegSeparate; is the definition the same?**

         none: GSEA does not normalize the enrichment scores.

        Randomization mode. Method used to generate a random number for phenotype permutations. Not used for tag permutations.

         no_balance (default). Randomizes the phenotypes without regard to phenotypes.

         equalize_and_balance. **need definition**

        Omit features with no symbol match. By default (true), the new dataset excludes features (genes) that have no gene symbols. Set to False to have the new dataset contain all features (genes) that were in the original dataset.

        Make detailed gene set report. Set to True (default) to create a detailed gene set report for each enriched gene set.

        Median for class metrics. Set to True to use the median of each class, instead of the mean, for the class separation metrics. The Use median instead of mean for metrics item in the <a href="#_Options">Options menu</a> controls the default setting for this parameter. (If you change the setting in the <a href="#_Options">Options menu</a>, the new default takes effect the next time you open GSEA.)

        Number of markers. **needs definition**

        Plot graphs for the top sets of each phenotype. Generates summary plots and detailed analysis results for the top x genes in each phenotype, where x is 20, by default. The top genes are those with the largest normalized enrichment scores.

        Seed for permutation. Seed (timestamp, 149) used to generate a random number for phenotype and tag permutations. The specific seed value (149) generates consistent results, which is useful when testing software.

        Save random ranked lists. Set to True (default=false) to save the random ranked lists of genes created by phenotype permutations. When you save random ranked lists, for each permutation, GSEA saves the rank metric score for each gene (the score used to position the gene in the ranked list). Saving random ranked lists is memory intensive; therefore, this parameter is set to false by default. **Aravind: is the score in the saved file really the rank metric score?**

        Make a zipped file with all reports. Set to True (default=false) to create a zip file of the analysis results. The zip file is saved to the output results folder with all of the other files generated by the analysis.

Buttons

Buttons at the bottom of the page:

<a name="_Gene_Set_Utilities_Page"></a><a name="_XCollapseProbes"></a><a name="_CollapseDataset"></a><a name="_Past_Analyses_Page"></a><a name="_Analysis_History_Page"></a>        Help. Displays this page.

        Reset. Restores the default values for all parameters.

        Last. Sets all parameters to the values used the last time you ran this analysis.

        Command. Displays the command line used to run the analysis, as described in <a href="#_Running_GSEA_from_the Command Line">Running GSEA from the Command Line</a>.

        Low/Normal (cpu usage). Determines the amount of CPU dedicated to this analysis. To use your computer for other tasks while running GSEA in the background, choose Low. To complete your analysis more quickly, choose <st1:place w:st="on"><st1:city w:st="on">Normal</st1:city></st1:place>.

        Run. Starts the analysis.