Data formats

From GeneSetEnrichmentAnalysisWiki
Revision as of 09:13, 24 April 2006 by Hkuehn (talk | contribs)
Jump to navigation Jump to search

Each GSEA supported file is an ASCII text file with a specific format. This page describes the format of each of the GSEA files.

For examples of each supported file, see the sample files in http://www.broad.mit.edu/cancer/software/gsea_beta/resources/datasets_index.html. Or, from within the GSEA application, select Help>Show GSEA Home Folder and go to the examples subfolder.

Expression Data Formats

Note: The GCT & RES expression formats supported by GSEA are identical to those supported by GenePattern.

GCT: Gene Cluster Text file format (*.gct)

The GCT format is a tab delimited file format that describes an expression dataset. It is organized as follows:

Gct format snapshot.gif

The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:

#1.2

The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:        (# of data rows) (tab) (# of data columns)
Example:            7129 58

The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.

Line format:        Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
Example:            Name Description DLBC1_1 DLBC2_1 ... DLBC58_0

The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.

Line format:        (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
Example:            AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44

RES: ExpRESsion (with P and A calls) file format (*.res)

The RES file format is a tab delimited file format that describes an expression dataset. It is organized as follows. The main difference between RES and GCT file formats is the RES file format contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software.

Res format snapshot.gif

The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).

Line format:      Description (tab) Accession (tab) (sample 1 name) (tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)

For example:    Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0

The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions. Our RES file creation tool places the sample data file name and scale factors in this row, as shown below.

Line format:      (tab) (sample 1 description) (tab) (tab) (sample 2 description) (tab) (tab) ... (sample N description)

Example:          MG2000062219AA MG2000062256AA/scale factor=1.2172 ... MG2000062211AA/scale factor=1.1214

The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.

Line format:      (# of data rows)

For example:    7129

The remainder of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GSEA ignores the Absent/Marginal/Present call.

Line format:      (gene description) (tab) (gene name) (tab) (sample 1 data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call) (tab) ... (sample N data) (tab) (sample N A/P call)

For example:    AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104 A -152 A ... -44 A

PCL: Stanford cDNA file format (*.pcl)

The PCL file format is a tab delimited file format that describes an expression dataset. It is organized as follows. Support for this format is provided because several Stanford cDNA datasets are available in the PCL format. For more information, see Stanford pcl file format.

Pcl format snapshot.gif

Phenotype Data Formats

CLS: Categorical (e.g tumor vs normal) class file format (*.cls)

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Categorical labels define discrete phenotypes; for example, normal vs tumor). For categorical labels, the CLS file format is organized as follows:

Cls format snapshot.png

The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.

Line format:      (number of samples) (space) (number of classes) (space) 1

Example:          58 2 1

The second line in a CLS file contains a name for each class. These are the class names that appear in analysis reports. The line should begin with a pound sign (#) followed by a space.

Line format:      # (space) (class 0 name) (space) (class 1 name)

Example:    # cured fatal/ref

The third line contains a class label for each sample. The label for a class can be the class name, a number, or a text string. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on. The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.

Line format:      (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)

Example:    0 0 0 ... 1 1

CLS: Continuous (e.g time-series or gene profile) file format (*.cls)

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. The CLS file format uses spaces or tabs to separate the fields.

The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes. Continuous phenotypes are used for time series experiments or to find gene sets correlations with a gene of interest (gene neighbors). For continuous labels, the CLS file format is organized as follows:

Cls numeric format snapshot.gif

Cls time series format snapshot.gif

Gene Set Database Formats

Note: Typically, you use the GMX or GMT formats to define gene sets.

GMX: Gene MatriX file format (*.gmx)

The GMX file format is a tab delimited file format that describes gene sets. In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX file format is organized as follows:

Gmx format snapshot.gif

Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GMT: Gene Matrix Transposed file format (*.gmt)

The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. The GMT file format is organized as follows:

Gmt format snapshot.gif

Each gene set is described by a name, a description, and the genes in the gene set. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is “na”, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.

GRP: Gene set file format (*.grp)

The GRP files contain a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP file format is organized as follows:

Grp format snapshot.gif

MDB: Molecular signature database file format (*.mdb)

The MDB files contain an entire gene set database. Unlike the gmt/gmx files, the MDB files are designed to contain rich annotation about a gene set. They are xml formatted file based on the MSigDB Document Type Definition (DTD). Following is the MSigDB DTD and a sample MDB file based on that DTD.

MSigDB DTD:

Msigdb dtd snapshot.gif

Example of an MSigDB xml formatted file:

Msigdb xml snapshot.gif

Microarray Chip Annotation Formats

CHIP: Chip file format (*.chip)

The CHIP file contains annotation about a microarray. It should list the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector.

Chip format snapshot.gif

CSV: Comma Separated Version (*.csv)

The CSV file format is identical to the CHIP file, except that the values in each row are separated by commas rather than by tabs. This file format is primarily used for Affymetrix chips.

Ranked Gene Lists

RNK: Ranked list file format (*.rnk)

The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. For instance, you might have used your favorite tTest-like statistic to produce a ranked ordered gene list from your dataset which you now want to test for enrichment.

Rnk format snapshot.gif