Difference between revisions of "Data formats"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 1: Line 1:
<br />
+
<h1><a name="_Toc127331825">Data Formats Supported by GSEA</a></h1>
<h1 class="news">&nbsp;Data formats supported by GSEA</h1>
+
<p><span style="">For sample files, see [http://wwwdev.broad.mit.edu/gsea/resources/datasets_index.html http://wwwdev.broad.mit.edu/gsea/resources/datasets_index.html] or, from within GSEA, select </span><span style=""><code><span style="font-size: 10pt;">Help&gt;Show GSEA Home Folder</span></code> and go to the </span><span style=""><code><span style="font-size: 10pt;">examples</span></code> subfolder.</span></p>
<img src="../../../images/input_file_formats.png" alt="GSEA input file formats snapshot" /><br />You can download example files, from [../resources/datasets_index.html here] or [[../resources/datasets_index.html here]] .<br />
+
<p class="MsoNormal"><span style=""><strong style=""><span style="font-size: 14pt;">Expression data formats<o:p></o:p></span></strong></span></p>
<h2>Expression data formats</h2>
+
<p class="MsoNormal"><span style=""></span><a href="#_GCT_File_Format"><span style="">GCT: Gene Cluster Text file format (*.gct)</span><span style=""></span></a><span style=""></span></p>
<ol>
+
<p class="MsoNormal"><span style=""></span><a href="#_RES_File_Format"><span style="">RES: ExpRESsion (with P and A calls) file format (*.res)</span><span style=""></span></a><span style=""> </span></p>
    <li><a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>         </li>
+
<p class="MsoNormal"><span style=""></span><a href="#_PCL_File_Format"><span style="">PCL: Stanford cDNA file format (*.pcl)</span><span style=""></span></a><span style=""> </span></p>
    <li><a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>         </li>
+
<p class="MsoNormal"><span style=""><strong style="">Note</strong>: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.</span></p>
    <li><a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>         </li>
+
<p class="MsoNormal"><span style=""><strong style=""><span style="font-size: 14pt;">Phenotype data formats<o:p></o:p></span></strong></span></p>
</ol>
+
<p class="MsoNormal"><span style=""></span><a href="#_CLS_File_Format:_Categorical"><span style="">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</span><span style=""></span></a><span style=""> </span></p>
<h2>Phenotype data formats</h2>
+
<p class="MsoNormal"><span style=""></span><a href="#_CLS_File_Format:_Continuous"><span style="">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</span><span style=""></span></a><span style=""> </span></p>
<ol>
+
<p class="MsoNormal"><span style=""><strong style=""><span style="font-size: 14pt;">Gene set database formats<o:p></o:p></span></strong></span></p>
    <li>             <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>         </li>
+
<p class="MsoNormal"><span style=""></span><a href="#_GMX_File_Format"><span style="">GMX: Gene MatriX file format (*.gmx)</span><span style=""></span></a><span style=""></span></p>
    <li>            <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format                 (*.cls)</a>         </li>
+
<p class="MsoNormal"><span style=""></span><a href="#_GMT_File_Format"><span style="">GMT: Gene Matrix Transposed file format (*.gmt)</span><span style=""></span></a><span style=""></span></p>
</ol>
+
<p class="MsoNormal"><span style=""></span><a href="#_GRP_File_Format"><span style="">GRP: Gene set file format (*.grp)</span><span style=""></span></a><span style=""></span></p>
<h2>Gene set database formats</h2>
+
<p class="MsoNormal"><span style=""></span><a href="#_MDB_File_Format"><span style="">MDB: Molecular signature database file format (*.mdb)</span><span style=""></span></a><span style=""></span></p>
<ol>
+
<p class="MsoNormal"><span style=""><strong style="">Note</strong>: Typically, you use the GMX or GMT formats to define gene sets.</span></p>
    <li><a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file             format             (*.gmx)</a></li>
+
<p class="MsoNormal"><span style=""><strong style=""><span style="font-size: 14pt;">Microarray annotation formats<o:p></o:p></span></strong></span></p>
    <li><a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a></li>
+
<p class="MsoNormal"><span style=""></span><a href="#_CHIP_File_Format"><span style="">CHIP: Chip file format (*.chip)</span><span style=""></span></a><span style=""> </span></p>
    <li><a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a></li>
+
<p class="MsoNormal"><span style=""></span><a href="#_CSV_File_Format_(for Chip Files)"><span style="">CSV: Comma Separated Version (*.csv)</span><span style=""></span></a><span style=""></span></p>
    <li><a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a></li>
+
<p class="MsoNormal"><span style=""><strong style=""><span style="font-size: 14pt;">Ranked gene lists<o:p></o:p></span></strong></span></p>
</ol>
+
<p class="MsoNormal"><span style=""></span><a href="#_RNK_File_Format"><span style="">RNK: Ranked list file format (*.rnk)</span><span style=""></span></a><span style=""></span></p>
<h2>Microarray annotation formats</h2>
+
<h1><span style=""><a name="_GCT:_Gene_Cluster_Text file format "></a><a name="_GCT_File_Format"></a>GCT File Format</span></h1>
<ol>
+
<p class="MsoNormal"><span style="">The GCT format is a tab delimited file format that describes an expression dataset. It is organized as follows:</span></p>
    <li>             <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>         </li>
+
<p class="MsoNormal"><span style="">[[image: gct_format_snapshot.png]]</span></p>
    <li><a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>         </li>
+
<p class="MsoNormal"><span style="">The <strong style="">first line</strong> contains the version string and is always the same for this file format. Therefore, the first line must be as follows: </span></p>
</ol>
+
<p class="MsoListContinue"><span style=""><code><span style="font-size: 10pt;">#1.2 <o:p></o:p></span></code></span></p>
<h2>Ranked gene lists</h2>
+
<p class="MsoNormal"><span style="">The <strong style="">second line</strong> contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns. </span></p>
<ol>
+
<p style="" class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>(</span><span style=""><code><span style="font-size: 10pt;"># of data rows) (tab) (# of data columns)<o:p></o:p></span></code></span></p>
    <li><a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a></li>
+
<p style="" class="MsoNormal"><span style="">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">7129 58 <o:p></o:p></span></code></span></p>
</ol>
+
<p class="MsoNormal"><span style="">The <strong style="">third line</strong> contains a list of identifiers for the samples associated with each of the columns in the remainder of the file. </span></p>
<p class="small">         Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
+
<p style="" class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name) <o:p></o:p></span></code></span></p>
<hr /> <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" /> <br />
+
<p style="" class="MsoNormal"><span style="">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span> </span><span style=""><code><span style="font-size: 10pt;">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0 <o:p></o:p></span></code></span></p>
<ol>
+
<p class="MsoNormal"><span style="">The <strong style="">remainder</strong> of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2. </span></p>
    <li>The first line contains the version string and is always the same for this file        format. Therefore, the first line must be as follows:
+
<p style="" class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data) <o:p></o:p></span></code></span></p>
    <ul>
+
<p style="" class="MsoNormal"><span style="">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span> </span><span style=""><code><span style="font-size: 10pt;">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44 <o:p></o:p></span></code></span></p>
        <li><font class="computerfont">#1.2</font>         </li>
+
<h1><span style=""><a name="_RES_File_Format"></a>RES File Format</span></h1>
    </ul>
+
<p class="MsoNormal"><span style="">The RES file format is a tab delimited file format that describes an expression dataset. It is organized as follows. The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.</span></p>
    </li>
+
<p class="MsoNormal"><span style="">[[image: res_format_snapshot.png]]</span></p>
    <li>The second line contains numbers indicating the size of the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
+
<p class="MsoNormal"><span style="">The <strong style="">first line</strong> contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call). </span></p>
    <ul>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)</span></code> </span></p>
        <li>Line format: (# of data rows) (tab) (# of data columns)            </li>
+
<p class="MsoNormal"><span style="">For example:<span style="">&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0</span></code> </span></p>
        <li>For example: <font class="computerfont">7129 58</font>         </li>
+
<p class="MsoNormal"><span style="">The <strong style="">second line</strong> contains a list of sample descriptions. Currently, GSEA ignores these descriptions. Our RES file creation tool places the sample data file name and scale factors in this row, as shown below.</span></p>
    </ul>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description) </span></code></span></p>
    </li>
+
<p class="MsoNormal"><span style="">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214</span></code> </span></p>
    <li>The third line contains a list of identifiers for the samples associated with each        of the columns in the remainder of the file.
+
<p class="MsoNormal"><span style="">The <strong style="">third line</strong> contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns. </span></p>
    <ul>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(# of data rows)</span></code> </span></p>
        <li>Line format: Name (tab) Description (tab) (sample 1 name) (tab)                (sample 2 name) (tab) ... (sample N name)            </li>
+
<p class="MsoNormal"><span style="">For example:<span style="">&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">7129</span></code> </span></p>
        <li>For example: <font class="computerfont">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0</font>         </li>
+
<p class="MsoNormal"><span style="">The <strong style="">remainder</strong> of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call. </span></p>
    </ul>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call) </span></code></span></p>
    </li>
+
<p class="MsoNormal"><span style="">For example:<span style="">&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A</span></code> </span></p>
    <li>The remainder of the data file contains data for each of the genes. There        is one line for each gene and one column for each of the samples. The        first two fields in the line contain name and descriptions for the genes        (names and descriptions can contain spaces since fields are separated by        tabs). The number of lines should agree with the number of data rows        specified on line 2.
+
<h1><span style=""><a name="_PCL_File_Format"></a>PCL File Format</span></h1>
    <ul>
+
<p class="MsoNormal"><span style="">The PCL file format is a tab delimited file format that describes an expression dataset. It is organized as follows. Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. </span></p>
        <li>Line format: (gene name) (tab) (gene description) (tab) (col 1 data)            (tab) (col 2 data) (tab) ... (col N data)        </li>
+
<p class="MsoNormal"><span style="">[[image: pcl_format_snapshot.png]]</span></p>
        <li>For example: <font class="computerfont">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous            control) -104 -152 -158 ... -44</font>     </li>
+
<h1><span style=""><a name="_CLS_File_Format:_Categorical"></a>CLS File Format: Categorical</span></h1>
    </ul>
+
<p class="MsoNormal"><span style="">The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.</span></p>
    </li>
+
<p class="MsoNormal"><span style="">The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Categorical labels define discrete phenotypes; for example, normal vs tumor). For categorical labels, the CLS file format is organized as follows:</span></p>
</ol>
+
<p class="MsoNormal"><span style="">[[image: cls_format_snapshot.png]]</span></p>
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br /> <hr /> <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br /> <img src="../../../images/res_format_snapshot.png" alt="RES format snapshot" />
+
<p class="MsoNormal"><span style="">The <strong style="">first line</strong> of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file. </span></p>
<ol>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(number of samples) (space) (number of classes) (space) 1 </span></code></span></p>
    <li>The first line contains a list of labels identifying the samples associated with        each of the columns in the remainder of the file. Two tabs (\t\t) separate the        sample identifier labels because each sample contains two data values (an        expression value and a present/marginal/absent call).
+
<p class="MsoNormal"><span style="">Example:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">58 2 1 </span></code></span></p>
    <ul>
+
<p class="MsoNormal"><span style="">The <strong style="">second line</strong> in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space. </span></p>
        <li>Line format: Description (tab) Accession (tab) (sample 1 name)                (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)            </li>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;"># (space) (class 0 name) (space) (class 1 name)</span></code> </span></p>
        <li>For example: <font class="computerfont">Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0</font>         </li>
+
<p class="MsoNormal"><span style="">For example:<span style="">&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;"># cured fatal/ref</span></code> </span></p>
    </ul>
+
<p class="MsoNormal"><span style="">The <strong style="">third line</strong> contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.</span></p>
    </li>
+
<p class="MsoNormal"><span style="">Line format:<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">(sample 1 class) (space) (sample 2 class) (space) ... (sample N class)</span></code></span></p>
    <li>The second line contains a list of sample descriptions. Currently,        GSEA ignores these descriptions.
+
<p class="MsoNormal"><span style="">For example:<span style="">&nbsp;&nbsp;&nbsp; </span></span><span style=""><code><span style="font-size: 10pt;">0 0 0 ... 1</span></code></span></p>
    <ul>
+
<h1><span style=""><a name="_CLS_File_Format:_Continuous"></a>CLS File Format: Continuous</span></h1>
        <li>Line format: (tab) (sample 1 description) (tab) (tab) (sample 2                description) (tab) (tab) ... (sample N description)            </li>
+
<p class="MsoNormal"><span style="">The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.</span></p>
        <li>For example, our RES file creation tool places the sample data file                name and scale factors in this row: <font class="computerfont">MG2000062219AA                MG2000062256AA/scale factor=1.2172 ...                MG2000062211AA/scale factor=1.1214</font>         </li>
+
<p class="MsoNormal"><span style="">The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Continuous phenotypes are used for time series experiments or to find gene sets correlations with a gene of interest (gene neighbors). For continuous labels, the CLS file format is organized as follows:</span></p>
    </ul>
+
<p class="MsoNormal"><span style="">[[image: cls_numeric_format_snapshot.png]]</span></p>
    </li>
+
<p class="MsoNormal"><span style="">[[image: cls_time_series_format_snapshot.png]]</span></p>
    <li>The third line contains a number indicating the number of rows in the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
+
<h1><span style=""><a name="_GMX_File_Format"></a>GMX File Format</span></h1>
    <ul>
+
<p class="MsoNormal"><span style="">The GMX file format is a tab delimited file format that describes gene sets. In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX file format is organized as follows:</span></p>
        <li>Line format: (# of data rows)            </li>
+
<p class="MsoNormal"><span style="">[[image: gmx_format_snapshot.png]]</span></p>
        <li>For example: <font class="computerfont">7129</font>         </li>
+
<p class="MsoNormal"><span style=""><a name="_GMT_File_Format"></a>Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is &ldquo;na&rdquo;, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.</span></p>
    </ul>
+
<h1><span style="">GMT File Format</span></h1>
    </li>
+
<p class="MsoNormal"><span style="">The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. The GMT file format is organized as follows:</span></p>
    <li>The rest of the data file contains data for each of the genes. There        is one row for each gene and two columns for each of the samples. The        first two fields in the row contain the description and name for each of the        genes (names and descriptions can contain spaces since fields are        separated by tabs). The description field is optional but the tab following it is not.        Each sample has two pieces of data associated with        it: an expression value and an associated Absent/Marginal/Present (A/M/P) call.        The A/M/P calls are generated by microarray scanning        software (such as Affymetrix's GeneChip software) and are an indication        of the confidence in the measured expression value. Currently,        GSEA ignores the Absent/Marginal/Present call.
+
<p class="MsoNormal"><span style="">[[image: gmt_format_snapshot.png]]</span></p>
    <ul>
+
<p class="MsoNormal"><span style=""><a name="_GRP_File_Format"></a>Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is &ldquo;na&rdquo;, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.</span></p>
        <li>Line format: (gene description) (tab) (gene name) (tab) (sample 1            data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call)            (tab) ... (sample N data) (tab) (sample N A/P call)        </li>
+
<h1><span style="">GRP File Format</span></h1>
        <li>For example: <font class="computerfont">AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104            A -152 A ... -44 A</font>     </li>
+
<p class="MsoNormal"><span style="">The GRP files contain a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP file format is organized as follows:</span></p>
    </ul>
+
<p class="MsoNormal"><span style="">[[image: grp_format_snapshot.png]]</span></p>
    </li>
+
<h1><span style=""><a name="_CHIP_File_Format"></a>CHIP File Format</span></h1>
</ol>
+
<p class="MsoNormal"><span style="">The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector. </span></p>
<hr /> <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br /> <center><img src="../../../images/pcl_format_snapshot.png" alt="pcl format snapshot" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br />  <center><img src="../../../images/cls_format_snapshot.png" alt="cls format snapshot" /></center>
+
<p class="MsoNormal"><span style="">[[image: chip_format_snapshot.png]]</span></p>
<ol>
+
<h1><span style=""><a name="_CSV_File_Format_(for Chip Files)"></a>CSV File Format (for Chip Files)</span></h1>
    <li>The first line of a CLS file contains numbers indicating the number of        samples and number of classes. The number of samples should        correspond to the number of samples in the associated RES or GCT data        file.
+
<p class="MsoNormal"><span style="">The CSV file format is identical to the CHIP file, except that the values in each row are separated by commas rather than by tabs. This file format is primarily used for Affymetrix chips.</span></p>
    <ul>
+
<h1><span style=""><a name="_RNK_File_Format"></a>RNK File Format</span></h1>
        <li>Line format: (number of samples) (space) (number of classes) (space) 1            </li>
+
<p class="MsoNormal"><span style="">The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used your favorite tTest-like statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment. </span></p>
        <li>For example: <font class="computerfont">58 2 1</font>         </li>
+
<p class="MsoNormal"><span style="">[[image: rnk_format_snapshot.png]]</span></p>
    </ul>
+
<h1><span style=""><a name="_MDB_File_Format"></a>MDB File Format</span></h1>
    </li>
+
<p style="margin-bottom: 12pt;" class="MsoNormal"><span style="">The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted file based on the MSigDB Document Type Definition (DTD). Following is the MSigDB DTD and a sample MDB file based on that DTD.</span></p>
    <li>The second line in a CLS file contains names for the class numbers. The        line should begin with a pound sign (#) followed by a space.
+
<p class="MsoNormal"><span style=""><strong>MSigDB DTD:<o:p></o:p></strong></span></p>
    <ul>
+
<p class="MsoNormal"><span style="">[[image: msigdb_dtd_snapshot.png]]</span></p>
        <li>Line format: # (space) (class 0 name) (space) (class 1 name)        </li>
+
<p class="MsoNormal"><span style=""><o:p>&nbsp;</o:p></span></p>
        <li>For example: <font class="computerfont"># cured fatal/ref</font>     </li>
+
<p class="MsoNormal"><span style=""><strong>Example of an MSigDB xml formatted file:<o:p></o:p></strong></span></p>
    </ul>
+
<p class="MsoNormal"><span style="">[[image: msigdb_xml_snapshot.png]]</span></p>
    </li>
 
    <li>The third line contains numeric class labels for each of the samples. The        number of class labels should be the same as the number of samples        specified in the first line.</li>
 
    <ul>
 
        <li>Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)</li>
 
        <li>For example: <font class="computerfont">0 0 0 ... 1</font></li>
 
    </ul>
 
</ol>
 
<hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img src="../../../images/cls_numeric_format_snapshot.png" alt="cls numeric snapshot" /></center> <center><img src="../../../images/cls_time_series_format_snapshot.png" alt="cls numeric snapshot" /></center> <hr /> <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmx_format_snapshot.png" alt="gmx format snapshot" /></center> <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmt_format_snapshot.png" alt="gmt format snapshot" /></center> <hr />  <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img src="../../../images/grp_format_snapshot.png" alt="grp format snapshot" /></center> <hr /> <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr /> <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr />  <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="rnk format snapshot" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type    Definition </a> for details about the format<br /> <br /> <center><img src="../../../images/msigdb_dtd_snapshot.png" alt="msigdb DTD format snapshot" /></center>  <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img src="../../../images/msigdb_xml_snapshot.png" alt="msigdb xml format snapshot" /></center>
 

Revision as of 12:09, 28 March 2006

<a name="_Toc127331825">Data Formats Supported by GSEA</a>

For sample files, see http://wwwdev.broad.mit.edu/gsea/resources/datasets_index.html or, from within GSEA, select Help>Show GSEA Home Folder and go to the examples subfolder.

Expression data formats<o:p></o:p>

<a href="#_GCT_File_Format">GCT: Gene Cluster Text file format (*.gct)</a>

<a href="#_RES_File_Format">RES: ExpRESsion (with P and A calls) file format (*.res)</a>

<a href="#_PCL_File_Format">PCL: Stanford cDNA file format (*.pcl)</a>

Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern.

Phenotype data formats<o:p></o:p>

<a href="#_CLS_File_Format:_Categorical">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>

<a href="#_CLS_File_Format:_Continuous">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>

Gene set database formats<o:p></o:p>

<a href="#_GMX_File_Format">GMX: Gene MatriX file format (*.gmx)</a>

<a href="#_GMT_File_Format">GMT: Gene Matrix Transposed file format (*.gmt)</a>

<a href="#_GRP_File_Format">GRP: Gene set file format (*.grp)</a>

<a href="#_MDB_File_Format">MDB: Molecular signature database file format (*.mdb)</a>

Note: Typically, you use the GMX or GMT formats to define gene sets.

Microarray annotation formats<o:p></o:p>

<a href="#_CHIP_File_Format">CHIP: Chip file format (*.chip)</a>

<a href="#_CSV_File_Format_(for Chip Files)">CSV: Comma Separated Version (*.csv)</a>

Ranked gene lists<o:p></o:p>

<a href="#_RNK_File_Format">RNK: Ranked list file format (*.rnk)</a>

<a name="_GCT:_Gene_Cluster_Text file format "></a><a name="_GCT_File_Format"></a>GCT File Format

The GCT format is a tab delimited file format that describes an expression dataset. It is organized as follows:

File:Gct format snapshot.png

The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:

#1.2 <o:p></o:p>

The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:        (# of data rows) (tab) (# of data columns)<o:p></o:p>

Example:            7129 58 <o:p></o:p>

The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.

Line format:        Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name) <o:p></o:p>

Example:            Name Description DLBC1_1 DLBC2_1 ... DLBC58_0 <o:p></o:p>

The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.

Line format:        (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data) <o:p></o:p>

Example:            AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44 <o:p></o:p>

<a name="_RES_File_Format"></a>RES File Format

The RES file format is a tab delimited file format that describes an expression dataset. It is organized as follows. The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.

File:Res format snapshot.png

The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).

Line format:      Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)

For example:    Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0

The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions. Our RES file creation tool places the sample data file name and scale factors in this row, as shown below.

Line format:      (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)

Example:          MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214

The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:      (# of data rows)

For example:    7129

The remainder of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.

Line format:      (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)

For example:    AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

<a name="_PCL_File_Format"></a>PCL File Format

The PCL file format is a tab delimited file format that describes an expression dataset. It is organized as follows. Support for this format is provided because several Stanford cDNA datasets are available in the PCL format.

File:Pcl format snapshot.png

<a name="_CLS_File_Format:_Categorical"></a>CLS File Format: Categorical

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Categorical labels define discrete phenotypes; for example, normal vs tumor). For categorical labels, the CLS file format is organized as follows:

Cls format snapshot.png

The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.

Line format:      (number of samples) (space) (number of classes) (space) 1

Example:          58 2 1

The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.

Line format:      # (space) (class 0 name) (space) (class 1 name)

For example:    # cured fatal/ref

The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.

Line format:      (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)

For example:    0 0 0 ... 1

<a name="_CLS_File_Format:_Continuous"></a>CLS File Format: Continuous

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Continuous phenotypes are used for time series experiments or to find gene sets correlations with a gene of interest (gene neighbors). For continuous labels, the CLS file format is organized as follows:

File:Cls numeric format snapshot.png

File:Cls time series format snapshot.png

<a name="_GMX_File_Format"></a>GMX File Format

The GMX file format is a tab delimited file format that describes gene sets. In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX file format is organized as follows:

File:Gmx format snapshot.png

<a name="_GMT_File_Format"></a>Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GMT File Format

The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. The GMT file format is organized as follows:

File:Gmt format snapshot.png

<a name="_GRP_File_Format"></a>Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GRP File Format

The GRP files contain a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP file format is organized as follows:

File:Grp format snapshot.png

<a name="_CHIP_File_Format"></a>CHIP File Format

The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector.

File:Chip format snapshot.png

<a name="_CSV_File_Format_(for Chip Files)"></a>CSV File Format (for Chip Files)

The CSV file format is identical to the CHIP file, except that the values in each row are separated by commas rather than by tabs. This file format is primarily used for Affymetrix chips.

<a name="_RNK_File_Format"></a>RNK File Format

The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used your favorite tTest-like statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment.

File:Rnk format snapshot.png

<a name="_MDB_File_Format"></a>MDB File Format

The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted file based on the MSigDB Document Type Definition (DTD). Following is the MSigDB DTD and a sample MDB file based on that DTD.

MSigDB DTD:<o:p></o:p>

File:Msigdb dtd snapshot.png

<o:p> </o:p>

Example of an MSigDB xml formatted file:<o:p></o:p>

File:Msigdb xml snapshot.png