Difference between revisions of "Data formats"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
Line 1: Line 1:
 
<h1 class="news">Data formats supported by GSEA</h1>
 
<h1 class="news">Data formats supported by GSEA</h1>
<img alt="GSEA input file formats snapshot" src="../../../images/input_file_formats.png" />
+
<img src="../../../images/input_file_formats.png" alt="GSEA input file formats snapshot" />
<p align="right"><a href="../../../resources/datasets_index.html">    <img border="0" src="../../../images/examples.jpg" alt="" /><img border="0" src="../../../images/arrow.png" alt="" /> </a> </p>
+
<p align="right"><a href="../../../resources/datasets_index.html">    <img border="0" alt="" src="../../../images/examples.jpg" /><img border="0" alt="" src="../../../images/arrow.png" /> </a> </p>
 
<h2>Expression data formats</h2>
 
<h2>Expression data formats</h2>
 
<ol>
 
<ol>
Line 30: Line 30:
 
</ol>
 
</ol>
 
<p class="small">        Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
 
<p class="small">        Note: The GCT &amp; RES expression formats supported by GSEA are identical to those supported by GenePattern.        Some description is duplicated here - the GenePattern website has more documentation on file formats.    </p>
<hr />  <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img alt="GCT format snapshot" src="../../../images/gct_format_snapshot.png" /> <br />
+
<hr />  <br /> <br /> <a name="gct"><strong>GCT File Format</strong></a><br /> <br /> The GCT format is a tab delimited file format that is organized as follows<br /> <img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" /> <br />
 
<ol>
 
<ol>
 
     <li>The first line contains the version string and is always the same for this file        format. Therefore, the first line must be as follows:
 
     <li>The first line contains the version string and is always the same for this file        format. Therefore, the first line must be as follows:
Line 56: Line 56:
 
     </li>
 
     </li>
 
</ol>
 
</ol>
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br />  <hr />  <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br />  <img alt="RES format snapshot" src="../../../images/res_format_snapshot.png" />
+
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.<br />  <hr />  <a name="res"><strong>RES File Format</strong></a><br /> This is a tab delimited file format that is organized as follows:<br />  <img src="../../../images/res_format_snapshot.png" alt="RES format snapshot" />
 
<ol>
 
<ol>
 
     <li>The first line contains a list of labels identifying the samples associated with        each of the columns in the remainder of the file. Two tabs (\t\t) separate the        sample identifier labels because each sample contains two data values (an        expression value and a present/marginal/absent call).
 
     <li>The first line contains a list of labels identifying the samples associated with        each of the columns in the remainder of the file. Two tabs (\t\t) separate the        sample identifier labels because each sample contains two data values (an        expression value and a present/marginal/absent call).
Line 83: Line 83:
 
     </li>
 
     </li>
 
</ol>
 
</ol>
<hr />  <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br />  <center><img alt="pcl format snapshot" src="../../../images/pcl_format_snapshot.png" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br />  <center><img alt="cls format snapshot" src="../../../images/cls_format_snapshot.png" /></center>
+
<hr />  <a name="pcl"><strong>PCL File Format: Expression datasets</strong></a><br /> Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:<br />  <center><img src="../../../images/pcl_format_snapshot.png" alt="pcl format snapshot" /></center> <hr /> <a name="cls"><strong>CLS File Format: Categorical</strong></a><br /> <br /> The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.<br />  <center><img src="../../../images/cls_format_snapshot.png" alt="cls format snapshot" /></center>
 
<ol>
 
<ol>
 
     <li>The first line of a CLS file contains numbers indicating the number of        samples and number of classes. The number of samples should        correspond to the number of samples in the associated RES or GCT data        file.
 
     <li>The first line of a CLS file contains numbers indicating the number of        samples and number of classes. The number of samples should        correspond to the number of samples in the associated RES or GCT data        file.
Line 103: Line 103:
 
     </ul>
 
     </ul>
 
</ol>
 
</ol>
<hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img alt="cls numeric snapshot" src="../../../images/cls_numeric_format_snapshot.png" /></center> <center><img alt="cls numeric snapshot" src="../../../images/cls_time_series_format_snapshot.png" /></center> <hr />  <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmx format snapshot" src="../../../images/gmx_format_snapshot.png" /></center>  <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img alt="gmt format snapshot" src="../../../images/gmt_format_snapshot.png" /></center> <hr />  <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img alt="grp format snapshot" src="../../../images/grp_format_snapshot.png" /></center> <hr />  <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr />  <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img alt="grp format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr />  <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img alt="rnk format snapshot" src="../../../images/rnk_format_snapshot.png" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type    Definition </a> for details about the format<br /> <br /> <center><img alt="msigdb DTD format snapshot" src="../../../images/msigdb_dtd_snapshot.png" /></center>  <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img alt="msigdb xml format snapshot" src="../../../images/msigdb_xml_snapshot.png" /></center>
+
<hr /> <a name="cls2"><strong>CLS File Format: Continous</strong></a><br /> <br /> CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors) <br /> <center><img src="../../../images/cls_numeric_format_snapshot.png" alt="cls numeric snapshot" /></center> <center><img src="../../../images/cls_time_series_format_snapshot.png" alt="cls numeric snapshot" /></center> <hr />  <a name="gmx"><strong>Gene set database: GMX File Format</strong></a><br /> <br /> The GMX files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmx_format_snapshot.png" alt="gmx format snapshot" /></center>  <hr /> <a name="gmt"><strong>Gene set database: GMT File Format</strong></a><br /> <br /> The GMT files contain gene sets in a simple tab-delimited text format.<br /> <center><img src="../../../images/gmt_format_snapshot.png" alt="gmt format snapshot" /></center> <hr />  <a name="grp"><strong>GRP File Format</strong></a><br /> <br /> The GRP files contain a SINGLE gene set in a simple newline-delimited text format.<br /> <center><img src="../../../images/grp_format_snapshot.png" alt="grp format snapshot" /></center> <hr />  <a name="chip"><strong>CHIP File Format</strong></a><br /> <br /> The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results. <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr />  <a name="map"><strong>MAP File Format</strong></a><br /> <br /> The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" /></center> <hr />  <a name="rnk"><strong>RNK File Format</strong></a><br /> <br /> The RNK file contains a single, rank ordered gene list (<em>not</em> gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets). <br /> <center><img src="../../../images/rnk_format_snapshot.png" alt="rnk format snapshot" /></center> <hr /> <a name="mdb"><strong>MDB File Format</strong></a><br /> <br /> The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type    Definition </a> for details about the format<br /> <br /> <center><img src="../../../images/msigdb_dtd_snapshot.png" alt="msigdb DTD format snapshot" /></center>  <br /><strong>Example of an MSigDB xml formatted file</strong><br /> <center><img src="../../../images/msigdb_xml_snapshot.png" alt="msigdb xml format snapshot" /></center>

Revision as of 11:52, 24 March 2006

Data formats supported by GSEA

<img src="../../../images/input_file_formats.png" alt="GSEA input file formats snapshot" />

<a href="../../../resources/datasets_index.html"> <img border="0" alt="" src="../../../images/examples.jpg" /><img border="0" alt="" src="../../../images/arrow.png" /> </a>

Expression data formats

  1. <a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>
  2. <a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>
  3. <a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>

Phenotype data formats

  1. <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>
  2. <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>

Gene set database formats

  1. <a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a>
  2. <a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a>
  3. <a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a>
  4. <a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a>

Microarray annotation formats

  1. <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>
  2. <a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>

Ranked gene lists

  1. <a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a>

Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats.




<a name="gct">GCT File Format</a>

The GCT format is a tab delimited file format that is organized as follows
<img src="../../../images/gct_format_snapshot.png" alt="GCT format snapshot" />

  1. The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:
    • #1.2
  2. The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
    • Line format: (# of data rows) (tab) (# of data columns)
    • For example: 7129 58
  3. The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.
    • Line format: Name (tab) Description (tab) (sample 1 name) (tab) (sample 2 name) (tab) ... (sample N name)
    • For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
  4. The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.
    • Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
    • For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44

The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.


<a name="res">RES File Format</a>
This is a tab delimited file format that is organized as follows:
<img src="../../../images/res_format_snapshot.png" alt="RES format snapshot" />

  1. The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).
    • Line format: Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
    • For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
  2. The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions.
    • Line format: (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)
    • For example, our RES file creation tool places the sample data file name and scale factors in this row: MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214
  3. The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
    • Line format: (# of data rows)
    • For example: 7129
  4. The rest of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.
    • Line format: (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)
    • For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

<a name="pcl">PCL File Format: Expression datasets</a>
Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:

<img src="../../../images/pcl_format_snapshot.png" alt="pcl format snapshot" />

<a name="cls">CLS File Format: Categorical</a>

The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.

<img src="../../../images/cls_format_snapshot.png" alt="cls format snapshot" />
  1. The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
    • Line format: (number of samples) (space) (number of classes) (space) 1
    • For example: 58 2 1
  2. The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
    • Line format: # (space) (class 0 name) (space) (class 1 name)
    • For example: # cured fatal/ref
  3. The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.
    • Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
    • For example: 0 0 0 ... 1

<a name="cls2">CLS File Format: Continous</a>

CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors)

<img src="../../../images/cls_numeric_format_snapshot.png" alt="cls numeric snapshot" />
<img src="../../../images/cls_time_series_format_snapshot.png" alt="cls numeric snapshot" />

<a name="gmx">Gene set database: GMX File Format</a>

The GMX files contain gene sets in a simple tab-delimited text format.

<img src="../../../images/gmx_format_snapshot.png" alt="gmx format snapshot" />

<a name="gmt">Gene set database: GMT File Format</a>

The GMT files contain gene sets in a simple tab-delimited text format.

<img src="../../../images/gmt_format_snapshot.png" alt="gmt format snapshot" />

<a name="grp">GRP File Format</a>

The GRP files contain a SINGLE gene set in a simple newline-delimited text format.

<img src="../../../images/grp_format_snapshot.png" alt="grp format snapshot" />

<a name="chip">CHIP File Format</a>

The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results.

<img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" />

<a name="map">MAP File Format</a>

The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip).

<img src="../../../images/rnk_format_snapshot.png" alt="grp format snapshot" />

<a name="rnk">RNK File Format</a>

The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets).

<img src="../../../images/rnk_format_snapshot.png" alt="rnk format snapshot" />

<a name="mdb">MDB File Format</a>

The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type Definition </a> for details about the format

<img src="../../../images/msigdb_dtd_snapshot.png" alt="msigdb DTD format snapshot" />


Example of an MSigDB xml formatted file

<img src="../../../images/msigdb_xml_snapshot.png" alt="msigdb xml format snapshot" />