Cufflinks.cuffdiff (v6)

Finds significant changes in transcript expression, splicing, and promoter use

Author: Cole Trapnell et al, University of Maryland Center for Bioinformatics and Computational Biology

Contact:

gp-help@broadinstitute.org

Algorithm Version: Cufflinks 2.0.2

Summary

Cuffdiff finds significant changes in transcript expression, splicing, and promoter use.  You can use it to find differentially expressed genes and transcripts, as well as genes that are being differentially regulated at the transcriptional and post-transcriptional level. 

To identify a gene or transcript as differentially expressed, Cuffdiff tests the observed log fold change in its expression against the null hypothesis of no change (i.e., that the true log fold change = zero).  Because measurement error, technical variability, and cross-replicate biological variability might result in an observed log fold change that is nonzero even if the gene/transcript is not differentially expressed, Cuffdiff also assesses the significance of each comparison.  For more information on the model, see Trapnell et al (2013) or see the "How It Works" page on the Cufflinks site.

Cuffdiff also groups transcripts into biologically meaningful groups, such as transcripts that share the same transcription start site (TSS), in order to identify genes that are differentially regulated at the transcriptional or post-transcriptional level.

Cuffdiff was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the Cufflinks documentation for release 2.0.2.

Usage

The Cuffdiff module takes two or more fragment alignment SAM/BAM files (from TopHat or other read aligner), as well as a GTF file containing transcript annotations (such as merged.gtf from Cuffmerge) as input.

Cuffdiff produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former shows changes in splicing, and the latter shows changes in relative promoter use within a gene.

Provide your aligned files organized by condition.  Use the Add Another Condition button if you have multiple conditions.  For each condition you can specify multiple replicates by dragging in the associated BAM files.  The condition labels are required and each one should be unique in order for the module to associate them with the replicates.  These will be used as the labels in the cuffdiff result files.

Cuffdiff has several methods for normalizing library sizes (i.e. sequencing depths).

  • geometric: FPKMs and fragment counts are scaled via the median of the geometric means of fragment counts across all libraries, as described in Anders and Huber (Genome Biology, 2010). This policy is identical to the one used by DESeq. This is the default.
  • quartile: FPKMs and fragment counts are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries.  This can improve robustness of differential expression calls for less abundant genes and transcripts.
  • classic-fpkm: Library size factor is set to 1 - no scaling applied to FPKM values or fragment counts.  This is the method used by Cufflinks.

Regardless of the choice of normalization.method, Cuffdiff reports expression estimates both in units of FPKM (in fpkm_tracking files) and as read counts (in count_tracking files).  When scaling to units of FPKM, Cuffdiff requires a count of a RNA-Seq sample library’s “total mapped reads”, i.e., the FPKM denominator.  Setting FPKM.scaling to 'compatible-hits' instructs Cuffdiff to only include in its count of total mapped reads those fragments compatible with the transcripts identified in the provided annotation (GTF file).  Setting FPKM.scaling to 'total-hits' instructs Cuffdiff to include all of a sample library’s mapped reads in its count of total mapped reads, including those not compatible with any of the transcripts identified in the provided GTF file. 

The module will by default set the multi.read.correct parameter to yes, instructing Cuffdiff to use its two-stage read weighting algorithm, which improves abundance estimates in cases where reads map to multiple positions in the genome.  Note that this default differs from the Cuffdiff tool's default.  However, this is more consistent with the expected usage in GenePattern.  Individual reads will sometimes be mapped to multiple positions in the genome due to sequence repeats and homology.  This complicates the weighting of reads to transcripts when calculating transcript- and gene-level abundances.   A simplistic weighting algorithm divides each multi-mapped read uniformly across all the positions it maps to.   In addition to uniform weighting, Cuffdiff also supports a more sophisticated, two-stage, algorithm that first calculates initial abundance estimates for all transcripts using the uniform weighting scheme, and then re-estimates abundances using a weighting scheme that incorporates the initial abundance estimation, the inferred fragment length, and the fragment bias (if bias correction is enabled).  This more advanced two-stage read weighting scheme significantly improves the accuracy of abundance estimates and is used by default.

The Cuffdiff tool provides a number of additonal options and switches that are not directly available through this module's paramters.  The additional.cuffldiff.options parameter is provided to pass these through if you feel that you need them.  To use it, simply specify the extra option(s) along with any arguments in the input text field separated by spaces.  At this time, this parameter unfortunately does not easily support options which require a file argument.  Check the Cufflinks manual for more details of the available options.  Also note that there may be additional undocumented options; manually running the cufflinks executable at the command line with no arguments may show even more options.  If you feel that a particular missing option would be of broad general interest, please contact the GenePattern team and we will look into adding it.  Use of this parameter is recommended for expert use only; use it at your own discretion.  The GenePattern team does not explicitly test all of the possible options that may be passed through using this parameter and can only provide limited support.

For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page.

Important Notes:

Cuffdiff jobs can be very resource intensive.  If your job does not complete within a day, retry it on a server with more available memory, or, if you are running on the GenePattern public server, see this FAQ.

There are known issues that prevent Cuffdiff from running on the Mac Mini and possibly other Mac hardware.

Preparing to Run Cuffdiff

If you want Cuffdiff to look for changes in primary transcript expression, splicing, coding output, and promoter use, the input GTF transcript file needs to be annotated with certain attributes. These attributes are:

  • tss_id: The ID of a transcript's inferred start site; this determines which primary transcript this processed transcript is believed to come from.
  • p_id: The ID of the coding sequence this transcript contains.

The GTFs hosted on the GenePattern FTP site contain these annotations.

Important note on Cuffdiff results

This module may produce some empty files. This does not mean that the algorithm has failed.  It may be the result when no transcripts with differential expression are detected.  In particular, this may occur if there is no differential expression.

It may also be the result of using an input GTF transcript file that does not have the p_id annotation.  This attribute is attached to Cuffmerge output only when it is run with a reference annotation that includes coding sequence (CDS) records. Differential CDS analysis in Cuffdiff is only performed when all isoforms of a gene have p_id attributes.  The CDS output files will be empty if there is no p_id attribute in the input GTF.

References

Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seqNature Biotechnology. 2013;31:46-53.

Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578.

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-SeqBioinformatics. 2011 Sep 1;27(17):2325-9.

Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.  Nat Biotechnol. 2010;28:511-515.

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105-1111.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

Links

Cufflinks website.
Cufflinks manual.  Note that this information may be based on a subsequent version of Cufflinks.
TopHat website.

Parameters

Name Description
aligned files * A set of aligned files grouped by condition.
GTF file * A transcript annotation file in GTF/GFF format produced by cufflinks, cuffcompare, cuffmerge, or other source (such as a reference annotation GTF).  See the Input Files section for more information, particularly on required attributes.
time series  Analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Default: no
normalization method * The normalization method to be used.  Choices are geometric mean (the default), upper quartile, or raw mapped count (classic FPKM) normalization.  See the usage section for a discussion of these methods. Default: geometric
FPKM scaling *

Controls how Cuffdiff includes mapped fragments toward the number used in the FPKM denominator.  Use compatible-hits (the default) to count only those fragments compatible with some reference transcript.  Use total-hits to count all fragments, even those not compatible with a reference transcript.

frag bias correct  A genome reference multi-FASTA file for the bias detection and correction algorithm.  For more information on this algorithm, see "How It Works" on the Cufflinks website.
multi read correct * Instructs Cuffdiff to do use its two-stage read weighting algorithm to more accurately distribute a multi-mapped read's count contribution across the multiple loci the read mapped to. Default: yes.  Note that this default differs from the Cuffdiff tool's default.  See the Usage section above for details.
min alignment count  The minimum number of alignments in a locus needed to conduct significance testing on changes in that locus observed between samples.  If no testing is performed, changes in the locus are deemed not significant, and the locus's observed changes do not contribute to correction for multiple testing.  Default: 10
FDR  The allowed false discovery rate.
mask file  This file tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF/GFF file. It is recommended that annotated rRNA, mitochondrial transcripts, and other abundant transcripts you want to ignore in your analysis be included in this file. 
library type * The library type used to generate reads. The choices are inferred, fr-unstranded, fr-firststrand, fr-secondstrand, ff-unstranded, ff-firststrand, ff-secondstrand, and transfrags.  The default is inferred, meaning that no library type information is passed.
skip diff exp *

Tells Cuffdiff to perform quantification and normalization only and to skip its differential expression calculations which are computationally expensive.  The default is to compute differential expression.

additional cuffdiff options Additional options to be passed along to the Cuffdiff program at the command line. This parameter gives you a means to specify otherwise unavailable Cuffdiff options and switches not supported by the module; check the Cufflinks manual for details.  Note that the information at this link may refer to a subsequent version of Cufflinks.  Recommended for experts only; use this at your own discretion.

* - required

Cuffdiff pass-through options

The following may be useful for advanced users who wish to use the additional.cuffdiff.options parameter.  This is the 'usage' output from running cuffdiff at the command-line, which gives a list of all of the available options and switches.  Note that this was generated by Cuffdiff v2.0.2 and that the options here may differ from the documentation provided online at the Cufflinks website due to subsequent version updates.

cuffdiff v2.0.2 (3524M)
-----------------------------
Usage:   cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
   Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
General Options:
  -o/--output-dir              write all output files to this directory              [ default:     ./ ]
  --seed                       value of random number generator seed                 [ default:      0 ]
  -T/--time-series             treat samples as a time-series                        [ default:  FALSE ]
  -c/--min-alignment-count     minimum number of alignments in a locus for testing   [ default:   10 ]
  --FDR                        False discovery rate used in testing                  [ default:   0.05 ]
  -M/--mask-file               ignore all alignment within transcripts in this file  [ default:   NULL ]
  -b/--frag-bias-correct       use bias correction - reference fasta required        [ default:   NULL ]
  -u/--multi-read-correct      use 'rescue method' for multi-reads (more accurate)   [ default:  FALSE ]
  -N/--upper-quartile-norm     use upper-quartile normalization                      [ default:  FALSE ]
  --geometric-norm             use geometric mean normalization                      [ default:  TRUE ]
  --raw-mapped-norm            use raw mapped count normalized (classic FPKM)        [ default:  FALSE ]
  -L/--labels                  comma-separated list of condition labels
  -p/--num-threads             number of threads used during quantification          [ default:      1 ]
  --no-diff                    Don't generate differential analysis files            [ default:  FALSE ]

Advanced Options:
  --library-type               Library prep used for input reads                     [ default:  below ]
  -m/--frag-len-mean           average fragment length (unpaired reads only)         [ default:    200 ]
  -s/--frag-len-std-dev        fragment length std deviation (unpaired reads only)   [ default:     80 ]
  --num-importance-samples     number of importance samples for MAP restimation      [ DEPRECATED      ]
  --num-bootstrap-samples      Number of bootstrap replications                      [ DEPRECATED      ]
  --bootstrap-fraction         Fraction of fragments in each bootstrap sample        [ DEPRECATED      ]
  --max-mle-iterations         maximum iterations allowed for MLE calculation        [ default:   5000 ]
  --compatible-hits-norm       count hits compatible with reference RNAs only        [ default:   TRUE ]
  --total-hits-norm            count all hits for normalization                      [ default:  FALSE ]
  --poisson-dispersion         Don't fit fragment counts for overdispersion          [ default:  FALSE ]
  -v/--verbose                 log-friendly verbose processing (no progress bar)     [ default:  FALSE ]
  -q/--quiet                   log-friendly quiet processing (no progress bar)       [ default:  FALSE ]
  --no-update-check            do not contact server to check for update availability[ default:  FALSE ]
  --emit-count-tables          print count tables used to fit overdispersion         [ default:  FALSE ]
  --max-bundle-frags           maximum fragments allowed in a bundle before skipping [ default: 500000 ]
  --num-frag-count-draws       Number of fragment generation samples                 [ default:   1000 ]
  --num-frag-assign-draws      Number of fragment assignment samples per generation  [ default:      1 ]
  --max-frag-multihits         Maximum number of alignments allowed per fragment     [ default: unlim  ]
  --min-outlier-p              Min replicate p value to admit for testing            [ default:   0.01 ]
  --min-reps-for-js-test       Replicates needed for relative isoform shift testing  [ default:      3 ]
  --no-effective-length-correction   No effective length correction                  [ default:  FALSE ]
  --no-length-correction       No effective length correction                        [ default:  FALSE ]

Debugging use only:
  --read-skip-fraction         Skip a random subset of reads this size               [ default:    0.0 ]
  --no-read-pairs              Break all read pairs                                  [ default:  FALSE ]
  --trim-read-length           Trim reads to be this long (keep 5' end)              [ default:   none ]

Supported library types:
ff-firststrand
ff-secondstrand
ff-unstranded
fr-firststrand
fr-secondstrand
fr-unstranded (default)
transfrags

 


 

Input Files

  1. EITHER two SAM/BAM files containing aligned RNA-seq reads (in the <first.SAM.or.BAM.file> and <second.SAM.or.BAM.file> parameters)
            OR
    a text file containing the absolute pathnames of more than two SAM/BAM files (in the <input.files.list> parameter)
    SAM files of aligned RNA-Seq reads. Cuffdiff tests for differential expression and regulation between all pairs of samples.  SAM is a standard short read alignment that allows aligners to attach custom tags to individual alignments. BAM is the binary equivalent of SAM.  For more information about the SAM/BAM format, see the specification.
  2. <GTF.file>
    A transcript annotation file in GTF/GFF format.  Cuffdiff requires that transcripts in the input GTF be annotated with the tss_id and p_id attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use.  See the Cuffdiff documentation for a discussion of these required attributes.  For more information on GTF format, see the specification.
    A common file used as input here is merged.gtf from Cuffmerge.  The GenePattern FTP site also hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).

  3. <sample.labels>
    A text file containing a label for each sample, one label per line.  These labels replace the default "q0, q1, ...qN" labeling for each sample in the tracking output files.   While this parameter is optional, using it may make downstream analysis of your samples easier.  Note: this field should not be used with GP 3.8.0+.

  4. <frag.bias.correct>
    A genome reference multi-FASTA file.  This reference genome file instructs Cuffdiff to run the bias detection and correction algorithm.  For more information on this algorithm, see "How It Works" on the Cufflinks website.  For more information on the FASTA format, see this description.
    The GenePattern FTP site hosts a number of reference genomes, available in a dropdown selection (requires GenePattern 3.7.0+).

  5. <mask.file>
    A GTF/GFF file that specifies transcripts to be ignored.

Output Files

For more information on the formats of the individual output files, see the Cufflinks Web site.

  1. FPKM_tracking files
    Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. For more information on the FPKM_tracking format, see the file format page.  There are four FPKM tracking files:
    • isoforms.fpkm_tracking: Transcript FPKMs

    • genes.fpkm_tracking: Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene ID.

    • cds.fpkm_tracking: Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing the p_id (ID of the coding sequence each transcript), independent of tss_id.

    • tss_groups.fpkm_tracking: Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id (transcription start site [TSS] ID), which is the ID of the transcript's inferred start site, determining which primary transcript this processed transcript is believed to come from).

  2. Count tracking files
    Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript, and gene in each sample. Primary transcript and gene counts are computed by summing the counts of transcripts in each primary transcript group or gene group. The results are output in the format described here.   There are four count tracking files:

    • isoforms.count_tracking: Transcript counts.

    • genes.count_tracking: Gene counts. Tracks the summed counts of transcripts sharing each gene ID.

    • cds.count_tracking: Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id.

    • tss_groups.count_tracking: Primary transcript counts.  Tracks the summed counts of transcripts sharing each tss_id.

  3. Read group tracking files
    Cuffdiff calculates the expression and fragment count for each transcript, primary transcript, and gene in each replicate. The results are output in per-replicate tracking files in the format described here. There are four read group tracking files:

    • isoforms.read_group_tracking: Transcript read group tracking.

    • genes.read_group_tracking: Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene ID in each replicate.

    • cds.read_group_tracking: Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate.

    • tss_groups.read_group_tracking: Primary transcript FPKMs.  Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate.

  4. Differential expression tests
    These tab-delimited files list the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. For each pair of samples x and y, four files are created:

    • isoform_exp.diff: Transcript differential FPKM.

    • gene_exp.diff: Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id.

    • cds_exp.diff: Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id, independent of tss_id.

    • tss_group_exp.diff: Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id

  5. Differential splicing tests: splicing.diff
    This tab-delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e., how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.

  6. Differential coding output: cds.diff
    This tab-delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e., how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e., multi-protein genes) are listed here.

  7. Differential promoter use: promoters.diff
    This tab-delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e., how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e., multi-promoter genes) are listed here..

  8. Read group information: read_groups.info
    This tab-delimited file lists, for each replicate, key properties used by Cuffdiff during quantification, such as library normalization factors.

  9. Run information: run.info
    This tab-delimited file lists information about a Cuffdiff run to help track what options were provided.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
x86_64

Operating System:
Mac, Linux

Language:
C++, Perl

Version Comments

Version Release Date Description
6 2014-04-02 Provides a new condition-oriented UI, adds a parameter to allow pass through of extra Cuffdiff options, adds a parameter to skip differential expression, clarifies the normalization options
5 2013-09-20 Added hosted GTF file selectors and HTML-based docs.
4 2013-05-07 Updated to Cufflinks version 2.0.2
3 2012-01-13 Updated to Cufflinks.cuffdiff version 1.3.0
2 2012-12-23 Updated to Cufflinks.cuffdiff version 1.2.1
1 2011-04-11