Data formats
Contents
Data formats supported by GSEA
<img alt="GSEA input file formats snapshot" src="../../../images/input_file_formats.png" />
You can download example files, from [../resources/datasets_index.html here] or [[../resources/datasets_index.html here]] .
Expression data formats
- <a href="../../../doc/data_formats.html#gct">GCT: Gene Cluster Text file format (*.gct)</a>
- <a href="../../../doc/data_formats.html#res">RES: ExpRESsion (with P and A calls) file format (*.res)</a>
- <a href="../../../doc/data_formats.html#pcl">PCL: Stanford cDNA file format (*.pcl)</a>
Phenotype data formats
- <a href="../../../doc/data_formats.html#cls">CLS: Categorical (e.g tumor vs normal) class file format (*.cls)</a>
- <a href="../../../doc/data_formats.html#cls2">CLS: Continuous (e.g time-series or gene profile) file format (*.cls)</a>
Gene set database formats
- <a href="../../../doc/data_formats.html#gmx">GMX: Gene MatriX file format (*.gmx)</a>
- <a href="../../../doc/data_formats.html#gmt">GMT: Gene Matrix Transposed file format (*.gmt)</a>
- <a href="../../../doc/data_formats.html#grp">GRP: Gene set file format (*.grp)</a>
- <a href="../../../doc/data_formats.html#mdb">MDB: Molecular signature database file format (*.mdb)</a>
Microarray annotation formats
- <a href="../../../doc/data_formats.html#chip">CHIP: Chip file format (*.chip)</a>
- <a href="../../../doc/data_formats.html#map">MAP: Chip mapping file format (*.map)</a>
Ranked gene lists
- <a href="../../../doc/data_formats.html#rnk">RNK: Ranked list file format (*.rnk)</a>
Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern. Some description is duplicated here - the GenePattern website has more documentation on file formats.
<a name="gct">GCT File Format</a>
The GCT format is a tab delimited file format that is organized as follows
<img alt="GCT format snapshot" src="../../../images/gct_format_snapshot.png" />
- The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:
- #1.2
- The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
- Line format: (# of data rows) (tab) (# of data columns)
- For example: 7129 58
- The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.
- Line format: Name (tab) Description (tab) (sample 1 name) (tab) (sample 2 name) (tab) ... (sample N name)
- For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
- The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.
- Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
- For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44
The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.
<a name="res">RES File Format</a>
This is a tab delimited file format that is organized as follows:
<img alt="RES format snapshot" src="../../../images/res_format_snapshot.png" />
- The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).
- Line format: Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
- For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
- The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions.
- Line format: (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)
- For example, our RES file creation tool places the sample data file name and scale factors in this row: MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214
- The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
- Line format: (# of data rows)
- For example: 7129
- The rest of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.
- Line format: (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)
- For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A
<a name="pcl">PCL File Format: Expression datasets</a>
Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. This is a tab delimited file format that is organized as follows:
<a name="cls">CLS File Format: Categorical</a>
The CLS files are text files created to load class information into GSEA. These files use spaces or tabs to separate the fields.
- The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
- Line format: (number of samples) (space) (number of classes) (space) 1
- For example: 58 2 1
- The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
- Line format: # (space) (class 0 name) (space) (class 1 name)
- For example: # cured fatal/ref
- The third line contains numeric class labels for each of the samples. The number of class labels should be the same as the number of samples specified in the first line.
- Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
- For example: 0 0 0 ... 1
<a name="cls2">CLS File Format: Continous</a>
CLS files can also be used to analyze continuous profiles such as those from a time series experiment or to find gene sets correlations with a gene of interest (gene neighbors)
<a name="gmx">Gene set database: GMX File Format</a>
The GMX files contain gene sets in a simple tab-delimited text format.
<a name="gmt">Gene set database: GMT File Format</a>
The GMT files contain gene sets in a simple tab-delimited text format.
<a name="grp">GRP File Format</a>
The GRP files contain a SINGLE gene set in a simple newline-delimited text format.
<a name="chip">CHIP File Format</a>
The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results.
<a name="map">MAP File Format</a>
The MAP file contains annotations that map probe sets between microarrays. This file is not used directly in the GSEA algorithm, but is used to generate gene sets (via chip2chip).
<a name="rnk">RNK File Format</a>
The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used you_favorite_tTest_like_statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment (note that only gene tag permutations are possible with rnk datasets).
<a name="mdb">MDB File Format</a>
The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted. Consult the <a href="../../../doc/msigdb.dtd.txt">MSigDB Document type Definition </a> for details about the format
Example of an MSigDB xml formatted file