ExpressionFileCreator (v13) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Creates a RES or GCT file from a set of Affymetrix CEL files. For IVT arrays only; use AffySTExpressionFileCreator for ST arrays.

Author: Joshua Gould, David Eby Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Summary

The ExpressionFileCreator module creates a gene expression dataset from a ZIP archive containing individual Affymetrix CEL files. The conversion is done using one of  the following algorithms:
The result is a matrix containing one intensity value per probe set, in the GCT or RES file format. Samples can be annotated by specifying a CLM file. A CLM file allows you to change the name of the samples in the expression matrix, reorder the columns, select a subset of the scans in the input ZIP file, and create a class label file in the CLS format. By default, sample names are taken from the CEL file names contained in the ZIP file. A CLM file allows you to specify the sample names explicitly. Additionally, the columns in the expression matrix are reordered so that they are in the same order as the scan names appear in the  CLM file. For example, the input  ZIP file contains the files scan1.cel, scan2.cel, and scan3.cel. The CLM file could contain the following text:
scan3     sample3    tumor
scan1     sample1    tumor
scan2     sample2    normal
The column names in the expression matrix would be: sample3, sample1, sample2. Additionally, only scan names in the CLM file will be used to construct the GCT or RES file; scans not present in the CLM file will be ignored.
 
Note:  A number of newer Affymetrix array types are not current supported by ExpressionFileCreator, including the 1.1, 2.0, 2.1 ST arrays, Exon arrays, and HTA 2.0 arrays.  This is the case even if a CDF file is provided.  Please use the AffySTExpressionFileCreator module for these arrays instead.

References

Affymetrix. Affymetrix Microarray Suite User Guide, version 5. Santa Clara, CA:Affymetrix, 2001.

Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249-264.
 
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001;98:31-36.3
 
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology. 2011;2:research0032-research0032.11.

Parameters

Name Description
input file * A zip file of CEL files
method * The method to use. Note that dchip and MAS5 will not work with ST arrays.
quantile normalization  (GCRMA and RMA only) Whether to normalize data using quantile normalization
background correct  (RMA only) Whether to background correct using RMA background correction
compute present absent calls  Whether to compute Present/Absent calls
normalization method  (MAS5 only) The normalization method to apply after expression values are computed. The column having the median of the means is used as the reference unless the parameter value to scale to is given.
value to scale to  (median/mean scaling only) The value to scale to.
clm file  A tab-delimited text file containing one scan, sample, and class per line
annotate probes * Whether to annotate probes with the gene symbol and description
cdf file  Custom CDF file. Leave blank to use default internally provided CDF file (custom cdf file is not implemented for GCRMA).
output file * The base name of the output file(s)

* - required

Input Files

  1. input.file
    A ZIP bundle containing the CEL files to be analyzed.  Note that this ZIP must be flat, containing no subdirectories, and should also contain no files other than CEL files (including 'dot' files on UNIX/Mac).  Special characters (especially spaces) in the file names of both the ZIP bundle and its contents may cause problems.  We recommend replacing these with underscores instead.  See this FAQ entry for more information.
  2. clm.file
    An optional CLM file to describe samples (name and phenotype class) and their mapping to CEL files.
  3. cdf.file
    An alternate CDF file to use for the analysis.  This is optional.

Output Files

  1. GCT file (if present/absent calls are NOT computed) or RES file (if present/absent 
    calls ARE computed)
  2. CLS file (if a CLM file is supplied)

Requirements

ExpressionFileCreator requires R 2.15.3 with the following packages, each of which will automatically download and install when the module is installed:
boot_1.3-7 IRanges_1.16.2 spatial_7.3-5
class_7.3-5 Biobase_2.18.0 BiocGenerics_0.4.0
cluster_1.14.3 AnnotationDbi_1.20.1 affyio_1.26.0
foreign_0.8-51 zlibbioc_1.4.0 preprocessCore_1.20.0
KernSmooth_2.23-8 Matrix_1.0-9 affy_1.36.0
lattice_0.20-10 mgcv_1.7-21 Biostrings_2.26.2
MASS_7.3-22 nlme_3.1-105 gcrma_2.30.0
DBI_0.2-5 nnet_7.3-5 makecdfenv_1.36.0
RSQLite_0.11.2 rpart_3.1-55  
 
Please install R2.15.3 instead of R2.15.2 before installing the module. The GenePattern team has confirmed test data reproducibility for this module using R2.15.3 compared to R2.15.2 and can only provide limited support for other versions. The GenePattern team recommends R2.15.3, which fixes significant bugs in R2.15.2, and which must be installed and configured independently as discussed in Using Different Versions of R and Using the R Installer Plug-in. These sections also provide patch level fixes that are necessary when additional installations of R are made and considerations for those who use R outside of GenePattern.

Notes

  • The MAS5 and dChip algorithms are based on their Bioconductor implementations. Therefore the results obtained from these algorithms will differ slightly from their official implementations.
  • The GCRMA and RMA algorithms produce values that are in log2 but ExpressionFileCreator removes the log2 transformation before generating the result file.
  • ST 1.1+ and ST exon arrays are not currently supported.  Please use AffySTExpressionFileCreator instead.
  • The underlying Affymetrix R package used by ExpressionFileCreator v12 fixes a bug in the dChip algorithm implementation.  Unfortunately, this means that dChip expression files created with previous versions are not directly comparable with newly created dChip files.  It is our strong recommendation that you discard older dChip results and re-create the expression files with the new version.

Arrays Supported:

For a list of arrays supported by R2.15 please see http://bioconductor.org/packages/2.10/data/annotation/
Alternatively, you can provide a CDF file with your job to process other array types.

Common Errors

Check the GenePattern FAQ regarding errors you may encounter:  http://www.broadinstitute.org/cancer/software/genepattern/doc/faq

 

Platform Dependencies

Task Type:
Preprocess & Utilities

CPU Type:
any

Operating System:
any

Language:
R 2.15

Version Comments

Version Release Date Description
12.3 2016-02-02 Updated to make use of the R package installer. Fixes a bug in failing to remove downloaded annotation files.
12 2013-10-31 Updated to R 2.15
11 2013-02-14 Updated to include Affy Annotation CSVs from Feb 2012
10 2012-04-06 updated to use new csv, removed tiger versions, renamed leopard version, removed extraneous R scripts, edited to point to correct packages, updated Affyio package to one that's build for R2.8
9 2012-01-26 Fixed memory corruption bug when reading some CDF files and with annotating probes when some annotations are missing
8 2008-10-29 Read latest Affymetrix CEL file format
6 2008-09-10 Added option to provide custom CDF file
5 2008-02-19 Added option to provide custom CDF file and Updated for R 2.5.0
4 2006-11-13 Fixes scaling bug
3 2006-07-20 Added gcRMA and dChip algorithms
2 2006-06-19 Added gcRMA and dChip algorithms
1 2005-09-16