Difference between revisions of "Data formats"
Line 1: | Line 1: | ||
+ | <h1 class="news"> </h1> | ||
+ | <h1 class="news"> </h1> | ||
+ | <h1>Data Formats Supported by GSEA</h1> | ||
+ | <p class="MsoNormal"><!--[if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" | ||
+ | o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" | ||
+ | stroked="f"> | ||
+ | <v:stroke joinstyle="miter" /> | ||
+ | <v:formulas> | ||
+ | <v:f eqn="if lineDrawn pixelLineWidth 0" /> | ||
+ | <v:f eqn="sum @0 1 0" /> | ||
+ | <v:f eqn="sum 0 0 @1" /> | ||
+ | <v:f eqn="prod @2 1 2" /> | ||
+ | <v:f eqn="prod @3 21600 pixelWidth" /> | ||
+ | <v:f eqn="prod @3 21600 pixelHeight" /> | ||
+ | <v:f eqn="sum @0 0 1" /> | ||
+ | <v:f eqn="prod @6 1 2" /> | ||
+ | <v:f eqn="prod @7 21600 pixelWidth" /> | ||
+ | <v:f eqn="sum @8 21600 0" /> | ||
+ | <v:f eqn="prod @7 21600 pixelHeight" /> | ||
+ | <v:f eqn="sum @10 21600 0" /> | ||
+ | </v:formulas> | ||
+ | <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect" /> | ||
+ | <o:lock v:ext="edit" aspectratio="t" /> | ||
+ | </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" alt="GSEA input file formats snapshot" | ||
+ | style='width:480pt;height:252pt'> | ||
+ | <v:imagedata src="file:///C:\DOCUME~1\hkuehn\LOCALS~1\Temp\msohtml1\01\clip_image001.png" | ||
+ | o:href="http://wwwdev.broad.mit.edu/gsea/images/input_file_formats.png" /> | ||
+ | </v:shape><![endif]--><!--[if !vml]--><img width="640" height="336" v:shapes="_x0000_i1025" alt="GSEA input file formats snapshot" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image002.jpg" /><!--[endif]--></p> | ||
+ | <p class="MsoNormal">For examples of each format, </p> | ||
+ | <p align="right" style="text-align: right;"><a href="../../../resources/datasets_index.html"><span style="text-decoration: none;"><!--[if gte vml 1]><v:shape | ||
+ | id="_x0000_i1026" type="#_x0000_t75" alt="" style='width:104.25pt;height:21.75pt' | ||
+ | o:button="t"> | ||
+ | <v:imagedata src="file:///C:\DOCUME~1\hkuehn\LOCALS~1\Temp\msohtml1\01\clip_image003.jpg" | ||
+ | o:href="http://wwwdev.broad.mit.edu/gsea/images/examples.jpg" /> | ||
+ | </v:shape><![endif]--><!--[if !vml]--><img width="139" height="29" border="0" v:shapes="_x0000_i1026" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image004.jpg" alt="" /><!--[endif]--><!--[if gte vml 1]><v:shape id="_x0000_i1027" | ||
+ | type="#_x0000_t75" alt="" style='width:12pt;height:12pt' o:button="t"> | ||
+ | <v:imagedata src="file:///C:\DOCUME~1\hkuehn\LOCALS~1\Temp\msohtml1\01\clip_image005.gif" | ||
+ | o:href="http://wwwdev.broad.mit.edu/gsea/images/arrow.png" /> | ||
+ | </v:shape><![endif]--><!--[if !vml]--><img width="16" height="16" border="0" v:shapes="_x0000_i1027" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image005.gif" alt="" /><!--[endif]--></span></a></p> | ||
+ | <p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Expression data formats<o:p></o:p></span></strong></p> | ||
+ | <p class="MsoNormal"><a href="#_GCT:_Gene_Cluster_Text file format ">GCT: Gene Cluster Text file format (*.gct)</a></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a> </p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a> </p> | ||
+ | <p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Phenotype data formats<o:p></o:p></span></strong></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a> </p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a> </p> | ||
+ | <p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Gene set database formats<o:p></o:p></span></strong></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a></p> | ||
+ | <p class="MsoNormal">**fix Description: na for link to MSigDB; URL (http…) for link to own page**</p> | ||
+ | <p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Microarray annotation formats<o:p></o:p></span></strong></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a> </p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a> </p> | ||
+ | <p class="MsoNormal">**remove MAP, add CSV (same as chip, but with commas); graphics need fixing**</p> | ||
+ | <p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Ranked gene lists<o:p></o:p></span></strong></p> | ||
+ | <p class="MsoNormal"><a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a></p> | ||
+ | <p class="small"><strong style="">Note</strong>: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats. </p> | ||
+ | <h1><a name="_GCT:_Gene_Cluster_Text file format "></a>GCT: Gene Cluster Text file format (*.gct)</h1> | ||
+ | <p class="MsoNormal">The GCT format is a tab delimited file format that is organized as follows:</p> | ||
+ | <h1 class="news"> </h1> | ||
<h1 class="news">Data formats supported by GSEA</h1> | <h1 class="news">Data formats supported by GSEA</h1> | ||
− | <img | + | <img src="../../../images/input_file_formats.png" alt="GSEA input file formats snapshot" /><br />You can download example files, from [../resources/datasets_index.html here] or [[../resources/datasets_index.html here]] .<br /> |
<h2>Expression data formats</h2> | <h2>Expression data formats</h2> | ||
<ol> | <ol> | ||
Line 29: | Line 91: | ||
</ol> | </ol> | ||
<p class="small"> Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats. </p> | <p class="small"> Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats. </p> | ||
− | <hr /> <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img | + | <hr /> <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" /> <br /> |
<ol> | <ol> | ||
<li>The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows: | <li>The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows: | ||
Line 55: | Line 117: | ||
</li> | </li> | ||
</ol> | </ol> | ||
− | The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br /> <hr /> <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br /> <img | + | The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br /> <hr /> <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br /> <img src="../../../images/res_format_snapshot.png" alt="RES format snapshot" /> |
<ol> | <ol> | ||
<li>The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call). | <li>The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call). | ||
Line 82: | Line 144: | ||
</li> | </li> | ||
</ol> | </ol> | ||
− | <hr /> <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br /> <center><img | + | <hr /> <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br /> <center><img src="../../../images/pcl_format_snapshot.png" alt="pcl format snapshot" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br /> <center><img src="../../../images/cls_format_snapshot.png" alt="cls format snapshot" /></center> |
<ol> | <ol> | ||
<li>The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file. | <li>The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file. | ||
Line 102: | Line 164: | ||
</ul> | </ul> | ||
</ol> | </ol> | ||
− | <hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img | + | <hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img src="../../../images/cls_numeric_format_snapshot.png" alt="cls numeric snapshot" /></center> <center><img src="../../../images/cls_time_series_format_snapshot.png" alt="cls numeric snapshot" /></center> <hr /> <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmx_format_snapshot.png" alt="gmx format snapshot" /></center> <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmt_format_snapshot.png" alt="gmt format snapshot" /></center> <hr /> <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img src="../../../images/grp_format_snapshot.png" alt="grp format snapshot" /></center> <hr /> <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr /> <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr /> <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="rnk format snapshot" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type Definition </a> for details about the format<br /> <br /> <center><img src="../../../images/msigdb_dtd_snapshot.png" alt="msigdb DTD format snapshot" /></center> <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img src="../../../images/msigdb_xml_snapshot.png" alt="msigdb xml format snapshot" /></center> |
Revision as of 10:09, 28 March 2006
Contents
Data Formats Supported by GSEA
<img width="640" height="336" v:shapes="_x0000_i1025" alt="GSEA input file formats snapshot" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image002.jpg" />
For examples of each format,
<a href="../../../resources/datasets_index.html"><img width="139" height="29" border="0" v:shapes="_x0000_i1026" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image004.jpg" alt="" /><img width="16" height="16" border="0" v:shapes="_x0000_i1027" src="file:///C:%5CDOCUME~1%5Chkuehn%5CLOCALS~1%5CTemp%5Cmsohtml1%5C01%5Cclip_image005.gif" alt="" /></a>
Expression data formats<o:p></o:p>
<a href="#_GCT:_Gene_Cluster_Text file format ">GCT: Gene Cluster Text file format (*.gct)</a>
<a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>
<a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>
Phenotype data formats<o:p></o:p>
<a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>
<a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>
Gene set database formats<o:p></o:p>
<a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a>
<a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a>
<a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a>
<a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a>
**fix Description: na for link to MSigDB; URL (http…) for link to own page**
Microarray annotation formats<o:p></o:p>
<a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>
<a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>
**remove MAP, add CSV (same as chip, but with commas); graphics need fixing**
Ranked gene lists<o:p></o:p>
<a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a>
Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats.
<a name="_GCT:_Gene_Cluster_Text file format "></a>GCT: Gene Cluster Text file format (*.gct)
The GCT format is a tab delimited file format that is organized as follows:
Data formats supported by GSEA
<img src="../../../images/input_file_formats.png" alt="GSEA input file formats snapshot" />
You can download example files, from [../resources/datasets_index.html here] or [[../resources/datasets_index.html here]] .
Expression data formats
- <a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>
- <a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>
- <a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>
Phenotype data formats
- <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>
- <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>
Gene set database formats
- <a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a>
- <a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a>
- <a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a>
- <a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a>
Microarray annotation formats
- <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>
- <a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>
Ranked gene lists
- <a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a>
Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats.
<a name="gct">GCT File Format</a>
The GCT format is a tab delimited file format that is organized as follows
<img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" />
- The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:
- #1.2
- The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
- Line format: (# of data rows) (tab) (# of data columns)
- For example: 7129 58
- The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.
- Line format: Name (tab) Description (tab) (sample 1 name) (tab) (sample 2 name) (tab) ... (sample N name)
- For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
- The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.
- Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
- For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.
<a name="res">RES File Format</a>
This is a tab delimited file format that is organized as follows:
<img src="../../../images/res_format_snapshot.png" alt="RES format snapshot" />
- The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).
- Line format: Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
- For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
- The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions.
- Line format: (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)
- For example, our RES file creation tool places the sample data file name and scale factors in this row: MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214
- The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
- Line format: (# of data rows)
- For example: 7129
- The rest of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.
- Line format: (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)
- For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A
<a name="pcl">PCL File Format: Expression datasets</a>
Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:
<a name="cls">CLS File Format: Categorical</a>
The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.
- The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
- Line format: (number of samples) (space) (number of classes) (space) 1
- For example: 58 2 1
- The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
- Line format: # (space) (class 0 name) (space) (class 1 name)
- For example: # cured fatal/ref
- The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.
- Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
- For example: 0 0 0 ... 1
<a name="cls2">CLS File Format: Continous</a>
CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors)
<a name="gmx">Gene set database: GMX File Format</a>
The GMX files contain gene sets in a simple tab-delimited text format.
<a name="gmt">Gene set database: GMT File Format</a>
The GMT files contain gene sets in a simple tab-delimited text format.
<a name="grp">GRP File Format</a>
The GRP files contain a SINGLE gene set in a simple newline-delimited text format.
<a name="chip">CHIP File Format</a>
The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results.
<a name="map">MAP File Format</a>
The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip).
<a name="rnk">RNK File Format</a>
The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets).
<a name="mdb">MDB File Format</a>
The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type Definition </a> for details about the format
Example of an MSigDB xml formatted file