Difference between revisions of "Data formats"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 1: Line 1:
 +
<h1 class="news">Data formats supported by GSEA</h1>
 +
<img alt="GSEA input file formats snapshot" src="../../../images/input_file_formats.png" />
 +
<p align="right"><a href="../../../resources/datasets_index.html">    <img border="0" src="../../../images/examples.jpg" alt="" /><img border="0" src="../../../images/arrow.png" alt="" /> </a> </p>
 
<h2>Expression data formats</h2>
 
<h2>Expression data formats</h2>
 
<ol>
 
<ol>
Line 27: Line 30:
 
</ol>
 
</ol>
 
<p class="small">        Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
 
<p class="small">        Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
<hr />  <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" />
+
<hr />  <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img alt="GCT format snapshot" src="../../../images/gct_format_snapshot.png" /> <br />
 +
<ol>
 +
    <li>The first line contains the version string and is always the same for this file        format. Therefore, the first line must be as follows:
 +
    <ul>
 +
        <li><font class="computerfont">#1.2</font>        </li>
 +
    </ul>
 +
    </li>
 +
    <li>The second line contains numbers indicating the size of the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
 +
    <ul>
 +
        <li>Line format: (# of data rows) (tab) (# of data columns)            </li>
 +
        <li>For example: <font class="computerfont">7129 58</font>          </li>
 +
    </ul>
 +
    </li>
 +
    <li>The third line contains a list of identifiers for the samples associated with each        of the columns in the remainder of the file.
 +
    <ul>
 +
        <li>Line format: Name (tab) Description (tab) (sample 1 name) (tab)                (sample 2 name) (tab) ... (sample N name)            </li>
 +
        <li>For example: <font class="computerfont">Name Description DLBC1_1 DLBC2_1 ... DLBC58_0</font>        </li>
 +
    </ul>
 +
    </li>
 +
    <li>The remainder of the data file contains data for each of the genes. There        is one line for each gene and one column for each of the samples. The        first two fields in the line contain name and descriptions for the genes        (names and descriptions can contain spaces since fields are separated by        tabs). The number of lines should agree with the number of data rows        specified on line 2.
 +
    <ul>
 +
        <li>Line format: (gene name) (tab) (gene description) (tab) (col 1 data)            (tab) (col 2 data) (tab) ... (col N data)        </li>
 +
        <li>For example: <font class="computerfont">AFFX-BioB-5_at AFFX-BioB-5_at (endogenous            control) -104 -152 -158 ... -44</font>      </li>
 +
    </ul>
 +
    </li>
 +
</ol>
 +
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br />  <hr />  <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br />  <img alt="RES format snapshot" src="../../../images/res_format_snapshot.png" />
 +
<ol>
 +
    <li>The first line contains a list of labels identifying the samples associated with        each of the columns in the remainder of the file. Two tabs (\t\t) separate the        sample identifier labels because each sample contains two data values (an        expression value and a present/marginal/absent call).
 +
    <ul>
 +
        <li>Line format: Description (tab) Accession (tab) (sample 1 name)                (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)            </li>
 +
        <li>For example: <font class="computerfont">Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0</font>          </li>
 +
    </ul>
 +
    </li>
 +
    <li>The second line contains a list of sample descriptions. Currently,        GSEA ignores these descriptions.
 +
    <ul>
 +
        <li>Line format: (tab) (sample 1 description) (tab) (tab) (sample 2                description) (tab) (tab) ... (sample N description)            </li>
 +
        <li>For example, our RES file creation tool places the sample data file                name and scale factors in this row: <font class="computerfont">MG2000062219AA                MG2000062256AA/scale factor=1.2172 ...                MG2000062211AA/scale factor=1.1214</font>        </li>
 +
    </ul>
 +
    </li>
 +
    <li>The third line contains a number indicating the number of rows in the data table that        is contained in the remainder of the file. Note that the name and        description columns are not included in the number of data columns.
 +
    <ul>
 +
        <li>Line format: (# of data rows)            </li>
 +
        <li>For example: <font class="computerfont">7129</font>          </li>
 +
    </ul>
 +
    </li>
 +
    <li>The rest of the data file contains data for each of the genes. There        is one row for each gene and two columns for each of the samples. The        first two fields in the row contain the description and name for each of the        genes (names and descriptions can contain spaces since fields are        separated by tabs). The description field is optional but the tab following it is not.        Each sample has two pieces of data associated with        it: an expression value and an associated Absent/Marginal/Present (A/M/P) call.        The A/M/P calls are generated by microarray scanning        software (such as Affymetrix's GeneChip software) and are an indication        of the confidence in the measured expression value. Currently,        GSEA ignores the Absent/Marginal/Present call.
 +
    <ul>
 +
        <li>Line format: (gene description) (tab) (gene name) (tab) (sample 1            data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call)            (tab) ... (sample N data) (tab) (sample N A/P call)        </li>
 +
        <li>For example: <font class="computerfont">AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104            A -152 A ... -44 A</font>    </li>
 +
    </ul>
 +
    </li>
 +
</ol>
 +
<hr />  <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br />  <center><img alt="pcl format snapshot" src="../../../images/pcl_format_snapshot.png" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br />  <center><img alt="cls format snapshot" src="../../../images/cls_format_snapshot.png" /></center>
 +
<ol>
 +
    <li>The first line of a CLS file contains numbers indicating the number of        samples and number of classes. The number of samples should        correspond to the number of samples in the associated RES or GCT data        file.
 +
    <ul>
 +
        <li>Line format: (number of samples) (space) (number of classes) (space) 1            </li>
 +
        <li>For example: <font class="computerfont">58 2 1</font>          </li>
 +
    </ul>
 +
    </li>
 +
    <li>The second line in a CLS file contains names for the class numbers. The        line should begin with a pound sign (#) followed by a space.
 +
    <ul>
 +
        <li>Line format: # (space) (class 0 name) (space) (class 1 name)        </li>
 +
        <li>For example: <font class="computerfont"># cured fatal/ref</font>    </li>
 +
    </ul>
 +
    </li>
 +
    <li>The third line contains numeric class labels for each of the samples. The        number of class labels should be the same as the number of samples        specified in the first line.</li>
 +
    <ul>
 +
        <li>Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)</li>
 +
        <li>For example: <font class="computerfont">0 0 0 ... 1</font></li>
 +
    </ul>
 +
</ol>
 +
<hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img alt="cls numeric snapshot" src="../../../images/cls_numeric_format_snapshot.png" /></center> <center><img alt="cls numeric snapshot" src="../../../images/cls_time_series_format_snapshot.png" /></center> <hr />  <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmx format snapshot" src="../../../images/gmx_format_snapshot.png" /></center>  <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmt format snapshot" src="../../../images/gmt_format_snapshot.png" /></center> <hr />  <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img alt="grp format snapshot" src="../../../images/grp_format_snapshot.png" /></center> <hr />  <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr />  <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr />  <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img alt="rnk format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type    Definition </a> for details about the format<br /> <br /> <center><img alt="msigdb DTD format snapshot" src="../../../images/msigdb_dtd_snapshot.png" /></center>  <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img alt="msigdb xml format snapshot" src="../../../images/msigdb_xml_snapshot.png" /></center>

Revision as of 11:48, 24 March 2006

Data formats supported by GSEA

<img alt="GSEA input file formats snapshot" src="../../../images/input_file_formats.png" />

<a href="../../../resources/datasets_index.html"> <img border="0" src="../../../images/examples.jpg" alt="" /><img border="0" src="../../../images/arrow.png" alt="" /> </a>

Expression data formats

  1. <a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>
  2. <a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>
  3. <a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>

Phenotype data formats

  1. <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>
  2. <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>

Gene set database formats

  1. <a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a>
  2. <a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a>
  3. <a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a>
  4. <a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a>

Microarray annotation formats

  1. <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>
  2. <a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>

Ranked gene lists

  1. <a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a>

Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats.




<a name="gct">GCT File Format</a>

The GCT format is a tab delimited file format that is organized as follows
<img alt="GCT format snapshot" src="../../../images/gct_format_snapshot.png" />

  1. The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:
    • #1.2
  2. The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
    • Line format: (# of data rows) (tab) (# of data columns)
    • For example: 7129 58
  3. The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.
    • Line format: Name (tab) Description (tab) (sample 1 name) (tab) (sample 2 name) (tab) ... (sample N name)
    • For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
  4. The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.
    • Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
    • For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44

The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.


<a name="res">RES File Format</a>
This is a tab delimited file format that is organized as follows:
<img alt="RES format snapshot" src="../../../images/res_format_snapshot.png" />

  1. The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).
    • Line format: Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
    • For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
  2. The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions.
    • Line format: (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)
    • For example, our RES file creation tool places the sample data file name and scale factors in this row: MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214
  3. The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
    • Line format: (# of data rows)
    • For example: 7129
  4. The rest of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.
    • Line format: (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)
    • For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

<a name="pcl">PCL File Format: Expression datasets</a>
Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:

<img alt="pcl format snapshot" src="../../../images/pcl_format_snapshot.png" />

<a name="cls">CLS File Format: Categorical</a>

The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.

<img alt="cls format snapshot" src="../../../images/cls_format_snapshot.png" />
  1. The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
    • Line format: (number of samples) (space) (number of classes) (space) 1
    • For example: 58 2 1
  2. The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
    • Line format: # (space) (class 0 name) (space) (class 1 name)
    • For example: # cured fatal/ref
  3. The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.
    • Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
    • For example: 0 0 0 ... 1

<a name="cls2">CLS File Format: Continous</a>

CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors)

<img alt="cls numeric snapshot" src="../../../images/cls_numeric_format_snapshot.png" />
<img alt="cls numeric snapshot" src="../../../images/cls_time_series_format_snapshot.png" />

<a name="gmx">Gene set database: GMX File Format</a>

The GMX files contain gene sets in a simple tab-delimited text format.

<img alt="gmx format snapshot" src="../../../images/gmx_format_snapshot.png" />

<a name="gmt">Gene set database: GMT File Format</a>

The GMT files contain gene sets in a simple tab-delimited text format.

<img alt="gmt format snapshot" src="../../../images/gmt_format_snapshot.png" />

<a name="grp">GRP File Format</a>

The GRP files contain a SINGLE gene set in a simple newline-delimited text format.

<img alt="grp format snapshot" src="../../../images/grp_format_snapshot.png" />

<a name="chip">CHIP File Format</a>

The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results.

<img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" />

<a name="map">MAP File Format</a>

The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip).

<img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" />

<a name="rnk">RNK File Format</a>

The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets).

<img alt="rnk format snapshot" src="../../../images/rnk_format_snapshot.png" />

<a name="mdb">MDB File Format</a>

The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type Definition </a> for details about the format

<img alt="msigdb DTD format snapshot" src="../../../images/msigdb_dtd_snapshot.png" />


Example of an MSigDB xml formatted file

<img alt="msigdb xml format snapshot" src="../../../images/msigdb_xml_snapshot.png" />