Difference between revisions of "Data formats"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
m (Clarified RNK file format)
 
(79 intermediate revisions by 6 users not shown)
Line 1: Line 1:
<p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Expression data formats<o:p></o:p></span></strong></p>
+
[http://www.broadinstitute.org/gsea/ GSEA Home] |
<p class="MsoNormal"><a href="#_GCT:_Gene_Cluster_Text file format ">GCT: Gene Cluster Text file format (*.gct)</a></p>
+
[http://www.broadinstitute.org/gsea/downloads.jsp Downloads] |
<p class="MsoNormal"><a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a> </p>
+
[http://www.broadinstitute.org/gsea/msigdb/ Molecular Signatures Database] |
<p class="MsoNormal"><a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a> </p>
+
[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page Documentation] |
<p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Phenotype data formats<o:p></o:p></span></strong></p>
+
[http://www.broadinstitute.org/gsea/contact.jsp Contact]
<p class="MsoNormal"><a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a> </p>
+
<br>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a> </p>
+
<br>
<p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Gene set database formats<o:p></o:p></span></strong></p>
+
<p>Each GSEA supported file is an ASCII text file with a specific format, as described below. For sample data sets, click [http://www.broadinstitute.org/gsea/datasets.jsp here].</p>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a></p>
+
<p>To create and edit GSEA files, use Excel or a text editor. If you are using Excel:</p>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a></p>
+
<ul>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a></p>
+
    <li>Be aware that Excel's auto-formatting can introduce errors in gene names, as described in <span style="font-family: Arial;">[http://www.ncbi.nlm.nih.gov/pubmed/15214961 Zeeberg, et al 2004]. <br>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a></p>
+
    </span></li>
<p class="MsoNormal">**fix Description: na for link to MSigDB; URL (http&hellip;) for link to own page**</p>
+
</ul>
<p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Microarray annotation formats<o:p></o:p></span></strong></p>
+
<ul>
<p class="MsoNormal"><a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a> </p>
+
    <li><span style="font-family: Arial;">To create a tab-delimited text file: select File&gt;Save As, enter the file name in quotes to preserve the the file extension (for example, </span>&quot;p53.gct&quot;), and select &quot;Text(Tab delimited)(*.txt)&quot; as the file type. Excel displays a message warning you that your file may contain features that are not compatible with this format and asks if you want to keep the workbook in this format. Click Yes to keep this format. Your file has now been saved. Exit from Excel. When Excel asks if you want to save your changes to this file, select No (you have already saved the file).
<p class="MsoNormal"><a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a> </p>
 
<p class="MsoNormal">**remove MAP, add CSV (same as chip, but with commas); graphics need fixing**</p>
 
<p class="MsoNormal"><strong style=""><span style="font-size: 14pt;">Ranked gene lists<o:p></o:p></span></strong></p>
 
<p class="MsoNormal"><a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a></p>
 
<p class="small"><strong style="">Note</strong>: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats. </p>
 
<h1><a name="_GCT:_Gene_Cluster_Text file format "></a>GCT: Gene Cluster Text file format (*.gct)</h1>
 
<p class="MsoNormal">The GCT format is a tab delimited file format that is organized as follows:</p>
 
<h1 class="news">&nbsp;</h1>
 
<h1 class="news">Data formats supported by GSEA</h1>
 
<img alt="GSEA input file formats snapshot" src="../../../images/input_file_formats.png" /><br />You can download example files, from [../resources/datasets_index.html here] or [[../resources/datasets_index.html here]] .<br />
 
<h2>Expression data formats</h2>
 
<ol>
 
    <li><a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>        </li>
 
    <li><a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>        </li>
 
    <li><a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>        </li>
 
</ol>
 
<h2>Phenotype data formats</h2>
 
<ol>
 
    <li>            <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>        </li>
 
    <li>            <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format                 (*.cls)</a>        </li>
 
</ol>
 
<h2>Gene set database formats</h2>
 
<ol>
 
    <li><a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file            format             (*.gmx)</a></li>
 
    <li><a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a></li>
 
    <li><a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a></li>
 
    <li><a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a></li>
 
</ol>
 
<h2>Microarray annotation formats</h2>
 
<ol>
 
    <li>            <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>        </li>
 
    <li><a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>        </li>
 
</ol>
 
<h2>Ranked gene lists</h2>
 
<ol>
 
    <li><a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a></li>
 
</ol>
 
<p class="small">        Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
 
<hr />  <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img alt="GCT format snapshot" src="../../../images/gct_format_snapshot.png" /> <br />
 
<ol>
 
    <li>The first line contains the version string and is always the same for this file         format. Therefore, the first line must be as follows:
 
    <ul>
 
        <li><font class="computerfont">#1.2</font>        </li>
 
    </ul>
 
 
     </li>
 
     </li>
    <li>The second line contains numbers indicating the size of the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
+
</ul>
    <ul>
+
<p><span style="font-family: Arial;">When creating files for GSEA, do not use hypens (-) in the file names. Due to restrictions imposed by certain Java libraries used by GSEA, the GSEA command line cannot accept file names that contain hypens.</span></p>
        <li>Line format: (# of data rows) (tab) (# of data columns)            </li>
+
<h1>Expression Data Formats</h1>
        <li>For example: <font class="computerfont">7129 58</font>         </li>
+
<p class="MsoNormal"><strong style="">Note</strong>: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.</p>
    </ul>
+
<h2>GCT: Gene Cluster Text file format (*.gct)</h2>
    </li>
+
<p class="MsoNormal">The GCT format is a tab delimited file format that describes an expression dataset. It is organized as follows: </p>
    <li>The third line contains a list of identifiers for the samples associated with each        of the columns in the remainder of the file.
+
<p class="MsoNormal">[[image:gct_format_snapshot.gif]]</p>
    <ul>
+
<p class="MsoNormal">The <strong style="">first line</strong> contains the version string and is always the same for this file format. Therefore, the first line must be as follows:<code><span style="font-size: 10pt;"><br />
        <li>Line format: Name (tab) Description (tab) (sample 1 name) (tab)                (sample 2 name) (tab) ... (sample N name)            </li>
+
</span></code></p>
        <li>For example: <font class="computerfont">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0</font>         </li>
+
<div style="margin-left: 40px;"><code><span style="font-size: 10pt;">#1.2</span></code><br />
    </ul>
+
<code></code></div>
    </li>
+
<p class="MsoNormal">The <strong style="">second line</strong> contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns. </p>
    <li>The remainder of the data file contains data for each of the genes. There        is one line for each gene and one column for each of the samples. The         first two fields in the line contain name and descriptions for the genes        (names and descriptions can contain spaces since fields are separated by        tabs). The number of lines should agree with the number of data rows        specified on line 2.
+
Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>(<code><span style="font-size: 10pt;"># of data rows) (tab) (# of data columns)</span></code><br />
    <ul>
+
<code>Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">7129 58</span></code><br />
        <li>Line format: (gene name) (tab) (gene description) (tab) (col 1 data)             (tab) (col 2 data) (tab) ... (col N data)        </li>
+
</code>
        <li>For example: <font class="computerfont">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous            control) -104 -152 -158 ... -44</font>     </li>
+
<p class="MsoNormal"><code>The <strong style="">third line</strong> contains a list of identifiers for the samples associated with each of the columns in the remainder of the file. </code></p>
    </ul>
+
Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)</span></code><br />
    </li>
+
<code>Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0</span></code><br />
</ol>
+
</code>
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br /> <hr />  <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br />  <img alt="RES format snapshot" src="../../../images/res_format_snapshot.png" />
+
<p class="MsoNormal"> The <strong style="">remainder</strong> of the data file contains data for each of the genes. There is one row for each gene and one column for each of the samples. The number of rows and columns should agree with the number of rows and columns specified on line 2. Each row contains a name, a description, and an intensity value for each sample. Names and descriptions can contain spaces, but may not be empty. If no description is available, enter a text string such as NA or NULL. Intensity values may be missing. To specify a missing intensity value, leave the field empty: ...(tab)(tab)....&nbsp;<br />
<ol>
+
</p>
    <li>The first line contains a list of labels identifying the samples associated with        each of the columns in the remainder of the file. Two tabs (\t\t) separate the        sample identifier labels because each sample contains two data values (an        expression value and a present/marginal/absent call).
+
<p style="" class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)<br />
    <ul>
+
</span></code>Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span> <code><span style="font-size: 10pt;">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44</span></code></p>
        <li>Line format: Description (tab) Accession (tab) (sample 1 name)                (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)            </li>
+
<p style="" class="MsoNormal"><strong>Example file</strong>: [http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/dataset_files/P53_hgu95av2.gct P53_hgu95av2.gct]</p>
        <li>For example: <font class="computerfont">Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0</font>          </li>
+
<h2>RES: ExpRESsion (with P and A calls) file format (*.res) </h2>
    </ul>
+
<p class="MsoNormal">The RES file format is a tab delimited file format that describes an expression dataset. It is organized as follows. The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.</p>
    </li>
+
<p class="MsoNormal">[[image:res_format_snapshot.gif]]</p>
    <li>The second line contains a list of sample descriptions. Currently,         GSEA ignores these descriptions.
+
<p class="MsoNormal">The <strong style="">first line</strong> contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call). </p>
    <ul>
+
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)</span></code> </p>
        <li>Line format: (tab) (sample 1 description) (tab) (tab) (sample 2                description) (tab) (tab) ... (sample N description)            </li>
+
<p class="MsoNormal">For example:<span style="">&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0</span></code> </p>
        <li>For example, our RES file creation tool places the sample data file                name and scale factors in this row: <font class="computerfont">MG2000062219AA                MG2000062256AA/scale factor=1.2172 ...                MG2000062211AA/scale factor=1.1214</font>         </li>
+
<p class="MsoNormal">The <strong style="">second line</strong> contains a list of sample descriptions. Currently, GSEA ignores these descriptions. Our RES file creation tool places the sample data file name and scale factors in this row, as shown below.</p>
    </ul>
+
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description) </span></code></p>
    </li>
+
<p class="MsoNormal">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214</span></code> </p>
    <li>The third line contains a number indicating the number of rows in the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
+
<p class="MsoNormal">The <strong style="">third line</strong> contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns. </p>
    <ul>
+
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(# of data rows)</span></code> </p>
        <li>Line format: (# of data rows)            </li>
+
<p class="MsoNormal">For example:<span style="">&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">7129</span></code> </p>
        <li>For example: <font class="computerfont">7129</font>         </li>
+
<p class="MsoNormal">The <strong style="">remainder</strong> of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call. </p>
    </ul>
+
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call) </span></code></p>
    </li>
+
<p class="MsoNormal">For example:<span style="">&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A</span></code> </p>
    <li>The rest of the data file contains data for each of the genes. There        is one row for each gene and two columns for each of the samples. The         first two fields in the row contain the description and name for each of the         genes (names and descriptions can contain spaces since fields are        separated by tabs). The description field is optional but the tab following it is not.        Each sample has two pieces of data associated with        it: an expression value and an associated Absent/Marginal/Present (A/M/P) call.         The A/M/P calls are generated by microarray scanning        software (such as Affymetrix's GeneChip software) and are an indication        of the confidence in the measured expression value. Currently,        GSEA ignores the Absent/Marginal/Present call.
+
<h2>PCL: Stanford cDNA file format (*.pcl) </h2>
    <ul>
+
<p class="MsoNormal">The PCL file format is a tab delimited file format that describes an expression dataset. It is organized as follows. Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. For more information, see [http://genome-www5.stanford.edu/help/formats.shtml#pcl Stanford pcl file format].<br />
        <li>Line format: (gene description) (tab) (gene name) (tab) (sample 1            data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call)            (tab) ... (sample N data) (tab) (sample N A/P call)        </li>
+
</p>
        <li>For example: <font class="computerfont">AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104            A -152 A ... -44 A</font>     </li>
+
<p class="MsoNormal" style="">[[image:pcl_format_snapshot.gif]]<br />
    </ul>
+
</p>
    </li>
+
<h2>TXT: Text file format for expression dataset (*.txt)</h2>
</ol>
+
<p class="MsoNormal">The TXT format is a tab delimited file format that describes an expression dataset. It is organized as follows:</p>
<hr />  <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br /> <center><img alt="pcl format snapshot" src="../../../images/pcl_format_snapshot.png" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br /> <center><img alt="cls format snapshot" src="../../../images/cls_format_snapshot.png" /></center>
+
<p class="MsoNormal">[[image:Txt_format_snapshot.gif]]</p>
<ol>
+
<p class="MsoNormal">The <strong style="">first line</strong> contains the labels Name and Description followed by the identifiers for each sample in the dataset. '''NOTE:''' The Description column is intended to be optional, but there is currently a bug such that it is treated as required. We hope to fix this in a future release.  If you have no descriptions available, a value of NA will suffice.<br />
    <li>The first line of a CLS file contains numbers indicating the number of         samples and number of classes. The number of samples should        correspond to the number of samples in the associated RES or GCT data        file.
+
</p>
    <ul>
+
Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)</span></code><br />
        <li>Line format: (number of samples) (space) (number of classes) (space) 1            </li>
+
<code>Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0</span></code><br />
        <li>For example: <font class="computerfont">58 2 1</font>         </li>
+
</code>
    </ul>
+
<p class="MsoNormal"> The <strong style="">remainder</strong> of the file contains data for each of the genes. There is one line for each gene. Each line contains the gene name, gene description, and a value for each sample in the dataset. <!-- If the first line contains the Description label, include a description for each gene. If the first line does not contain the Description label, do not include descriptions for any gene.--> <br /><br />
    </li>
+
Gene names and descriptions can contain spaces since fields are separated by tabs.&nbsp;<br />
    <li>The second line in a CLS file contains names for the class numbers. The        line should begin with a pound sign (#) followed by a space.
+
</p>
    <ul>
+
<!--
        <li>Line format: # (space) (class 0 name) (space) (class 1 name)        </li>
+
Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)<br />
        <li>For example: <font class="computerfont"># cured fatal/ref</font>     </li>
+
</span></code>Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span> <code><span style="font-size: 10pt;">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44</span></code>
    </ul>
+
-->
    </li>
+
<h1>Phenotype Data Formats</h1>
    <li>The third line contains numeric class labels for each of the samples. The        number of class labels should be the same as the number of samples         specified in the first line.</li>
+
<h2>CLS: Categorical (e.g tumor vs normal) class file format (*.cls) </h2>
    <ul>
+
<p class="MsoNormal">The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.</p>
        <li>Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)</li>
+
<p class="MsoNormal">The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Categorical labels define discrete phenotypes; for example, normal vs tumor. For categorical labels, the CLS file format is organized as follows:</p>
        <li>For example: <font class="computerfont">0 0 0 ... 1</font></li>
+
<p class="MsoNormal">[[image:cls_format_snapshot.png]]</p>
    </ul>
+
<p class="MsoNormal">The <strong style="">first line</strong> of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file. </p>
</ol>
+
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(number of samples) (space) (number of classes) (space) 1 </span></code></p>
<hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img alt="cls numeric snapshot" src="../../../images/cls_numeric_format_snapshot.png" /></center> <center><img alt="cls numeric snapshot" src="../../../images/cls_time_series_format_snapshot.png" /></center> <hr />  <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmx format snapshot" src="../../../images/gmx_format_snapshot.png" /></center> <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmt format snapshot" src="../../../images/gmt_format_snapshot.png" /></center> <hr />  <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img alt="grp format snapshot" src="../../../images/grp_format_snapshot.png" /></center> <hr />  <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr /> <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr /> <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img alt="rnk format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type    Definition </a> for details about the format<br /> <br /> <center><img alt="msigdb DTD format snapshot" src="../../../images/msigdb_dtd_snapshot.png" /></center> <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img alt="msigdb xml format snapshot" src="../../../images/msigdb_xml_snapshot.png" /></center>
+
<p class="MsoNormal">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">58 2 1 </span></code></p>
 +
<p class="MsoNormal">The <strong style="">second line</strong> in a CLS file contains a user-visible name for each class. These are the class names that appear in analysis reports. The line should begin with a pound sign (#) followed by a space. </p>
 +
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;"># (space) (class 0 name) (space) (class 1 name)</span></code> </p>
 +
<p class="MsoNormal">Example:<span style="">&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;"># cured fatal/ref</span></code> </p>
 +
<p class="MsoNormal">The <strong style="">third line</strong> contains a class label for each sample. The class label can be the class name, a number, or a text string. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on.  The number of class labels specified on this line should be the same as the number of samples specified in the first line.  The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.<br />
 +
</p>
 +
<p class="MsoNormal"><span style="font-weight: bold;">Note: </span>The order of the labels on the third line determines the association of class names and class labels, even if the class labels are the same as the class names and even if the labels are numbers.  The key point is that as the third line is processed left-to-right, it will take the first label it finds <span style="font-weight: bold;">no matter what it is</span> and map it to the first class name from the second line (also left-to-right).  Any other instances of that label then map to that same name.
 +
 
 +
After that, the second label found (on the third line) <span style="font-weight: bold;">different from the first</span> is mapped to the second name (on the second line), and likewise for any other instances.  If there are more unique labels than there are names then you'll get an error.
 +
 
 +
Since the third line represents your samples column-wise as they appear in the expression dataset, you need to arrange the class names on the second line in the order in which they're first encountered among your samples.  If you're also using numbers for labels, then you should encounter [0, 1, 2, ...] in order on the third line when reading left-to-right.
 +
<br />
 +
</p>
 +
<p class="MsoNormal">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">(sample 1 class) (space) (sample 2 class) (space) ... (sample N class)</span></code></p>
 +
<p class="MsoNormal">Example:<span style="">&nbsp;&nbsp;&nbsp; </span><code><span style="font-size: 10pt;">0 0 0 ... 1 1</span></code></p>
 +
<p class="MsoNormal"><strong>Example file</strong>: [http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/dataset_files/P53.cls P53.cls]</p>
 +
<h2>CLS: Continuous (e.g time-series or gene profile) file format (*.cls) </h2>
 +
<p class="MsoNormal">The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.</p>
 +
<p class="MsoNormal">The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Continuous phenotypes are used for time series experiments or to find gene sets correlated with a gene of interest (gene neighbors). A CLS file for continuous labels can contain one or more labels. The following example shows a CLS file that defines two continuous labels:<br />
 +
<br />
 +
</p>
 +
<p class="MsoNormal"><span style="font-family: courier new;">#numeric</span><br style="font-family: courier new;" />
 +
<span style="font-family: courier new;">#AFFX-BioB-5_st</span><br style="font-family: courier new;" />
 +
<span style="font-family: courier new;">206.0 31.0 252.0 -20.0 -169.0 -66.0 230.0 -23.0 67.0 173.0 -55.0 -20.0 469.0 -201.0 -117.0 -162.0 -5.0 -86.0 350.0 74.0 -215.0 193.0 506.0 183.0 350.0 113.0 -17.0 29.0 247.0 -131.0 358.0 561.0 24.0 524.0 167.0 -56.0 176.0 320.0</span><br style="font-family: courier new;" />
 +
<span style="font-family: courier new;">#AFFX-BioDn-5</span><br style="font-family: courier new;" />
 +
<span style="font-family: courier new;">75.0 142.0 32.0 109.0 -38.0 -80.0 62.0 39.0 196.0 -42.0 199.0 49.0 171.0 327.0 115.0 -71.0 85.0 80.0 270.0 182.0 208.0 -94.0 292.0 233.0 34.0 0.0 59.0 233.0 48.0 466.0 -7.0 -96.0 297.0 38.0 208.0 -15.0 30.0 357.0</span><br />
 +
</p>
 +
<br />
 +
The <span style="font-weight: bold;">first&nbsp; </span>line contains the text &quot;#numeric&quot; which indicates that the file defines continuous labels.<br />
 +
The <span style="font-weight: bold;">remainder </span>of the file defines the continuous phenotypes. For each phenotype:
 +
<ul>
 +
    <li>The <span style="font-weight: bold;">first </span>line defines the name of the phenotype; for example, #AFFX-BIOB-5_st.</li>
 +
    <li>The <span style="font-weight: bold;">second </span>line contains a value for each sample in the .gct file. Typically, your word processor wraps the second line of the phenotype definition, as shown in the example.</li>
 +
</ul>
 +
<br />
 +
<span style="font-size: 9pt; font-family: Arial;">For a continuous phenotype label, the values for the samples define the phenotype profile</span>. The relative change in the values defines the relative distance between points in the phenotype profile.  In the example shown above, the sample values for the two phenotype labels are gene expression values. <span style="font-size: 9pt; font-family: Arial;">The phenotype profile is the expression profile for a gene and is used to find gene sets correlated with that gene. For a time series experiment, you would choose sample values </span>that define the desired expression profile. The example shown below assumes that you have five samples taken at 30 minute intervals. The first phenotype label defines a phenotype profile that shows steadily increasing gene expression; the second defines a profile  that shows an initial peak and then gradual decrease:
 +
<p class="MsoListContinue2" style="font-family: courier new;">#numeric<br />
 +
#IncreasingProfle<br />
 +
30 60 90 120 150<br />
 +
#PeakProfle<br />
 +
5 20 15 10 5</p>
 +
<h1>Gene Set Database Formats</h1>
 +
<p class="MsoNormal"><strong style="">Note</strong>: Typically, you use the GMX or GMT formats to define gene sets.<br />
 +
</p>
 +
<h2>GMX: Gene MatriX file format (*.gmx)</h2>
 +
<p class="MsoNormal">The GMX file format is a tab delimited file format that describes gene sets. In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX file format is organized as follows:</p>
 +
<p class="MsoNormal">[[image:gmx_format_snapshot.gif]]</p>
 +
<p class="MsoNormal">Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is &ldquo;na&rdquo;, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.</p>
 +
<h2>GMT: Gene Matrix Transposed file format (*.gmt)</h2>
 +
<p class="MsoNormal">The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. The GMT file format is organized as follows:</p>
 +
<p class="MsoNormal">[[image:gmt_format_snapshot.gif]]</p>
 +
<p class="MsoNormal">Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is &ldquo;na&rdquo;, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.</p>
 +
<h2>GRP: Gene set file format (*.grp)</h2>
 +
<p class="MsoNormal">The GRP files contain a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP file format is organized as follows:</p>
 +
<p class="MsoNormal">[[image:grp_format_snapshot.gif]]</p>
 +
<h2>XML: Molecular signature database file format (msigdb_*.xml)</h2>
 +
<p class="MsoNormal" style="margin-bottom: 12pt;">The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted file based on the MSigDB Document Type Definition (DTD). Following is the MSigDB DTD and a sample MDB file based on that DTD.</p>
 +
<p class="MsoNormal"><strong>MSigDB DTD:</strong></p>
 +
<p class="MsoNormal">[[image:msigdb_dtd_snapshot.gif]]</p>
 +
<p class="MsoNormal"><strong>Example of an MSigDB xml formatted file:</strong></p>
 +
<p class="MsoNormal">[[image:msigdb_xml_snapshot.gif]]</p>
 +
<h1>Microarray Chip Annotation Formats</h1>
 +
<h2>CHIP: Chip file format (*.chip) </h2>
 +
<p class="MsoNormal">The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector.</p>
 +
<p class="MsoNormal">The CHIP file format is organized as follows:</p><br/>
 +
<p class="MsoNormal">[[image:chip_format_snapshot.gif]]</p><br/>
 +
<p class="MsoNormal">The file name must end with .chip extension.</p>
 +
<p class="MsoNormal">The <strong style="">first line</strong> contains column headings that identify the content of each column in the remainder of the file. The file must contain three column headings separated by tabs:
 +
<li>Probe Set ID
 +
<li>Gene Symbol
 +
<li>Gene Title
 +
</p>
 +
<p class="MsoNormal">The GENE_SYMBOL.chip file contains one additional column, Aliases, which is not shown here. When a gene is identified by more than one HUGO gene symbol, the Gene Symbol column contains the gene symbol that appears in the GSEA reports and the Alias column&nbsp; identifies other gene symbols used to reference the same gene. If a gene set or chip annotation file contains a gene in the Alias column, GSEA automatically converts it to the gene in the Gene Symbol column.</p>
 +
<p class="MsoNormal">The <strong style="">rest of the file</strong> contains data for each probe set ID used in the microarray.</p>
 +
<p class="MsoNormal">Line format: (probe set id) (tab) (gene symbol) (tab) (gene title)</p>
 +
 
 +
<h1>Ranked Gene Lists</h1>
 +
<h2>RNK: Ranked list file format (*.rnk)</h2>
 +
<p class="MsoNormal">The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a tab-delimited text format with each gene on a new line. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used your favorite tTest-like statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment. Order of lines does not matter. It is important, however, that the second column will have numeric values - they will be used to rank order genes by GSEA. </p>
 +
<p class="MsoNormal">[[image:rnk_format_snapshot.gif]]</p>

Latest revision as of 13:17, 15 December 2020

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

Each GSEA supported file is an ASCII text file with a specific format, as described below. For sample data sets, click here.

To create and edit GSEA files, use Excel or a text editor. If you are using Excel:

  • Be aware that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al 2004.
  • To create a tab-delimited text file: select File>Save As, enter the file name in quotes to preserve the the file extension (for example, "p53.gct"), and select "Text(Tab delimited)(*.txt)" as the file type. Excel displays a message warning you that your file may contain features that are not compatible with this format and asks if you want to keep the workbook in this format. Click Yes to keep this format. Your file has now been saved. Exit from Excel. When Excel asks if you want to save your changes to this file, select No (you have already saved the file).

When creating files for GSEA, do not use hypens (-) in the file names. Due to restrictions imposed by certain Java libraries used by GSEA, the GSEA command line cannot accept file names that contain hypens.

Expression Data Formats

Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern.

GCT: Gene Cluster Text file format (*.gct)

The GCT format is a tab delimited file format that describes an expression dataset. It is organized as follows:

Gct format snapshot.gif

The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:

#1.2

The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:        (# of data rows) (tab) (# of data columns)
Example:            7129 58

The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.

Line format:        Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
Example:            Name Description DLBC1_1 DLBC2_1 ... DLBC58_0

The remainder of the data file contains data for each of the genes. There is one row for each gene and one column for each of the samples. The number of rows and columns should agree with the number of rows and columns specified on line 2. Each row contains a name, a description, and an intensity value for each sample. Names and descriptions can contain spaces, but may not be empty. If no description is available, enter a text string such as NA or NULL. Intensity values may be missing. To specify a missing intensity value, leave the field empty: ...(tab)(tab).... 

Line format:        (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
Example:            AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44

Example file: P53_hgu95av2.gct

RES: ExpRESsion (with P and A calls) file format (*.res)

The RES file format is a tab delimited file format that describes an expression dataset. It is organized as follows. The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.

Res format snapshot.gif

The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).

Line format:      Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)

For example:    Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0

The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions. Our RES file creation tool places the sample data file name and scale factors in this row, as shown below.

Line format:      (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)

Example:          MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214

The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:      (# of data rows)

For example:    7129

The remainder of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.

Line format:      (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)

For example:    AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

PCL: Stanford cDNA file format (*.pcl)

The PCL file format is a tab delimited file format that describes an expression dataset. It is organized as follows. Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. For more information, see Stanford pcl file format.

Pcl format snapshot.gif

TXT: Text file format for expression dataset (*.txt)

The TXT format is a tab delimited file format that describes an expression dataset. It is organized as follows:

Txt format snapshot.gif

The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset. NOTE: The Description column is intended to be optional, but there is currently a bug such that it is treated as required. We hope to fix this in a future release. If you have no descriptions available, a value of NA will suffice.

Line format:        Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
Example:            Name Description DLBC1_1 DLBC2_1 ... DLBC58_0

The remainder of the file contains data for each of the genes. There is one line for each gene. Each line contains the gene name, gene description, and a value for each sample in the dataset.

Gene names and descriptions can contain spaces since fields are separated by tabs. 

Phenotype Data Formats

CLS: Categorical (e.g tumor vs normal) class file format (*.cls)

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Categorical labels define discrete phenotypes; for example, normal vs tumor. For categorical labels, the CLS file format is organized as follows:

Cls format snapshot.png

The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.

Line format:      (number of samples) (space) (number of classes) (space) 1

Example:          58 2 1

The second line in a CLS file contains a user-visible name for each class. These are the class names that appear in analysis reports. The line should begin with a pound sign (#) followed by a space.

Line format:      # (space) (class 0 name) (space) (class 1 name)

Example:    # cured fatal/ref

The third line contains a class label for each sample. The class label can be the class name, a number, or a text string. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on. The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.

Note: The order of the labels on the third line determines the association of class names and class labels, even if the class labels are the same as the class names and even if the labels are numbers. The key point is that as the third line is processed left-to-right, it will take the first label it finds no matter what it is and map it to the first class name from the second line (also left-to-right). Any other instances of that label then map to that same name. After that, the second label found (on the third line) different from the first is mapped to the second name (on the second line), and likewise for any other instances. If there are more unique labels than there are names then you'll get an error. Since the third line represents your samples column-wise as they appear in the expression dataset, you need to arrange the class names on the second line in the order in which they're first encountered among your samples. If you're also using numbers for labels, then you should encounter [0, 1, 2, ...] in order on the third line when reading left-to-right.

Line format:      (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)

Example:    0 0 0 ... 1 1

Example file: P53.cls

CLS: Continuous (e.g time-series or gene profile) file format (*.cls)

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Continuous phenotypes are used for time series experiments or to find gene sets correlated with a gene of interest (gene neighbors). A CLS file for continuous labels can contain one or more labels. The following example shows a CLS file that defines two continuous labels:

#numeric
#AFFX-BioB-5_st
206.0 31.0 252.0 -20.0 -169.0 -66.0 230.0 -23.0 67.0 173.0 -55.0 -20.0 469.0 -201.0 -117.0 -162.0 -5.0 -86.0 350.0 74.0 -215.0 193.0 506.0 183.0 350.0 113.0 -17.0 29.0 247.0 -131.0 358.0 561.0 24.0 524.0 167.0 -56.0 176.0 320.0
#AFFX-BioDn-5
75.0 142.0 32.0 109.0 -38.0 -80.0 62.0 39.0 196.0 -42.0 199.0 49.0 171.0 327.0 115.0 -71.0 85.0 80.0 270.0 182.0 208.0 -94.0 292.0 233.0 34.0 0.0 59.0 233.0 48.0 466.0 -7.0 -96.0 297.0 38.0 208.0 -15.0 30.0 357.0


The first  line contains the text "#numeric" which indicates that the file defines continuous labels.
The remainder of the file defines the continuous phenotypes. For each phenotype:

  • The first line defines the name of the phenotype; for example, #AFFX-BIOB-5_st.
  • The second line contains a value for each sample in the .gct file. Typically, your word processor wraps the second line of the phenotype definition, as shown in the example.


For a continuous phenotype label, the values for the samples define the phenotype profile. The relative change in the values defines the relative distance between points in the phenotype profile. In the example shown above, the sample values for the two phenotype labels are gene expression values. The phenotype profile is the expression profile for a gene and is used to find gene sets correlated with that gene. For a time series experiment, you would choose sample values that define the desired expression profile. The example shown below assumes that you have five samples taken at 30 minute intervals. The first phenotype label defines a phenotype profile that shows steadily increasing gene expression; the second defines a profile that shows an initial peak and then gradual decrease:

#numeric

  1. IncreasingProfle

30 60 90 120 150

  1. PeakProfle

5 20 15 10 5

Gene Set Database Formats

Note: Typically, you use the GMX or GMT formats to define gene sets.

GMX: Gene MatriX file format (*.gmx)

The GMX file format is a tab delimited file format that describes gene sets. In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX file format is organized as follows:

Gmx format snapshot.gif

Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GMT: Gene Matrix Transposed file format (*.gmt)

The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. The GMT file format is organized as follows:

Gmt format snapshot.gif

Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GRP: Gene set file format (*.grp)

The GRP files contain a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP file format is organized as follows:

Grp format snapshot.gif

XML: Molecular signature database file format (msigdb_*.xml)

The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted file based on the MSigDB Document Type Definition (DTD). Following is the MSigDB DTD and a sample MDB file based on that DTD.

MSigDB DTD:

Msigdb dtd snapshot.gif

Example of an MSigDB xml formatted file:

Msigdb xml snapshot.gif

Microarray Chip Annotation Formats

CHIP: Chip file format (*.chip)

The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector.

The CHIP file format is organized as follows:


Chip format snapshot.gif


The file name must end with .chip extension.

The first line contains column headings that identify the content of each column in the remainder of the file. The file must contain three column headings separated by tabs:

  • Probe Set ID
  • Gene Symbol
  • Gene Title

    The GENE_SYMBOL.chip file contains one additional column, Aliases, which is not shown here. When a gene is identified by more than one HUGO gene symbol, the Gene Symbol column contains the gene symbol that appears in the GSEA reports and the Alias column  identifies other gene symbols used to reference the same gene. If a gene set or chip annotation file contains a gene in the Alias column, GSEA automatically converts it to the gene in the Gene Symbol column.

    The rest of the file contains data for each probe set ID used in the microarray.

    Line format: (probe set id) (tab) (gene symbol) (tab) (gene title)

    Ranked Gene Lists

    RNK: Ranked list file format (*.rnk)

    The RNK file contains a single, rank ordered gene list (not gene set) in a tab-delimited text format with each gene on a new line. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used your favorite tTest-like statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment. Order of lines does not matter. It is important, however, that the second column will have numeric values - they will be used to rank order genes by GSEA.

    Rnk format snapshot.gif