Cufflinks (v8)

Cufflinks 2.2.1 - Assembles transcripts, estimates abundances, and tests for differential expression and regulation in RNA-seq samples

Author: Cole Trapnell et al, University of Maryland Center for Bioinformatics and Computational Biology

Contact:

gp-help@broadinstitute.org

Algorithm Version: Cufflinks 2.2.1

Summary

Cufflinks assembles transcripts and estimates their abundances in RNA-seq samples. It accepts aligned RNA-seq reads, then assembles the alignments into a parsimonious set of transcripts, reporting as few full-length transcript fragments [transfrags] as are needed to explain the data. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one. 

Cufflinks was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the Cufflinks documentation for release 2.2.1.

Usage

Cufflinks takes a file of alignments in SAM or BAM (the binary equivalent of SAM) format as input. For more details on the SAM/BAM format, see the Input Files section and/or the specification. The RNA-seq read mapper TopHat produces BAM output, and is recommended for use with Cufflinks. However Cufflinks will accept SAM/BAM alignments generated by any read mapper so long as they meet some particular requirements; see the Input Files section for more details.

Optionally, a reference genome annotation file can be submitted as well.  If it is sent to the GTF parameter, Cufflinks will use this file to estimate isoform expression and will not assemble novel transcripts; the program will ignore alignments not structurally compatible with any reference transcript.  It can also be sent to the GTF guide parameter to enable Cufflinks to use the reference annotation based transcript (RABT) assembly algorithm.  This guide file is used to generate faux-reads against which the actual reads are tiled so that every reference transcript position is covered by multiple reads, and the information in the faux-reads is merged with the data from the sequenced reads.  For more information, see Roberts et al (2011) or the "How It Works" page on the Cufflinks site.  The reference genome annotation GTF can be sent to either of these parameters.

The Cufflinks tool provides a number of additonal options and switches that are not directly available through this module's paramters.  The additional.cufflinks.options parameter is provided to pass these through if you feel that you need them.  To use it, simply specify the extra option(s) along with any arguments in the input text field separated by spaces.  At this time, this parameter unfortunately does not easily support options which require a file argument.  Check the Cufflinks manual for more details of the available options.  Also note that there may be additional undocumented options; manually running the cufflinks executable at the command line with no arguments may show even more options.  If you feel that a particular missing option would be of broad general interest, please contact the GenePattern team and we will look into adding it.  Use of this parameter is recommended for expert use only; use it at your own discretion.  The GenePattern team does not explicitly test all of the possible options that may be passed through using this parameter and can only provide limited support.  

For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page.

Important Notes:

Cufflinks jobs can be very resource intensive.  If your job does not complete within a day, retry it on a server with more available memory, or, if you are running on the GenePattern public server, see this FAQ.

There are known issues that prevent Cufflinks from running on the Mac Mini and possibly other Mac hardware.

References

Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seqNature Biotechnology. 2013;31:46-53.

Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578.

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-SeqBioinformatics. 2011 Sep 1;27(17):2325-9.

Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.  Nat Biotechnol. 2010;28:511-515.

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-SeqBioinformatics. 2009;25:1105-1111.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

Links

Cufflinks website.
Cufflinks manual.  Note that this information may be based on a subsequent version of Cufflinks.
TopHat website.

Parameters

Name Description
input file * Input file in SAM or BAM format
transfrag label  A label for the transfrags in the output files
GTF  Reference annotation (GFF/GTF file) for isoform expression estimates
GTF guide  Reference annotation (GFF/GTF file) to guide RABT assembly
mask file  A GTF file specifying transcripts to be ignored
frag bias correct  Reference (FASTA/FA) for bias detection and correction algorithm
multi read correct  Whether to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome
library type  The library type used to generate reads.
min frags per transfrag  Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are not reported.
output prefix * The prefix for the output file.
additional cufflinks options  Additional options to be passed along to the Cufflinks program at the command line. This parameter gives you a means to specify otherwise unavailable Cufflinks options and switches not supported by the module; check the Cufflinks manual for details. Recommended for experts only; use this at your own discretion.

* - required

Cufflinks pass-through options

The following may be useful for advanced users who wish to use the additional.cufflinks.options parameter.  This is the 'usage' output from running cufflinks at the command-line, which gives a list of all of the available options and switches.  Note that this was generated by Cufflinks v2.2.1 and that the options here may differ from the documentation provided online at the Cufflinks website due to subsequent version updates.


cufflinks v2.2.1

-----------------------------

Usage:   cufflinks [options] 

General Options:

  -o/--output-dir              write all output files to this directory              [ default:     ./ ]

  -p/--num-threads             number of threads used during analysis                [ default:      1 ]

  --seed                       value of random number generator seed                 [ default:      0 ]

  -G/--GTF                     quantitate against reference transcript annotations                      

  -g/--GTF-guide               use reference transcript annotation to guide assembly                   

  -M/--mask-file               ignore all alignment within transcripts in this file                     

  -b/--frag-bias-correct       use bias correction - reference fasta required        [ default:   NULL ]

  -u/--multi-read-correct      use 'rescue method' for multi-reads (more accurate)   [ default:  FALSE ]

  --library-type               library prep used for input reads                     [ default:  below ]

  --library-norm-method        Method used to normalize library sizes                [ default:  below ]



Advanced Abundance Estimation Options:

  -m/--frag-len-mean           average fragment length (unpaired reads only)         [ default:    200 ]

  -s/--frag-len-std-dev        fragment length std deviation (unpaired reads only)   [ default:     80 ]

  --max-mle-iterations         maximum iterations allowed for MLE calculation        [ default:   5000 ]

  --compatible-hits-norm       count hits compatible with reference RNAs only        [ default:  FALSE ]

  --total-hits-norm            count all hits for normalization                      [ default:  TRUE  ]

  --num-frag-count-draws       Number of fragment generation samples                 [ default:    100 ]

  --num-frag-assign-draws      Number of fragment assignment samples per generation  [ default:     50 ]

  --max-frag-multihits         Maximum number of alignments allowed per fragment     [ default: unlim  ]

  --no-effective-length-correction   No effective length correction                  [ default:  FALSE ]

  --no-length-correction       No length correction                                  [ default:  FALSE ]

  -N/--upper-quartile-norm     Deprecated, use --library-norm-method                 [    DEPRECATED   ]

  --raw-mapped-norm            Deprecated, use --library-norm-method                 [    DEPRECATED   ]



Advanced Assembly Options:

  -L/--label                   assembled transcripts have this ID prefix             [ default:   CUFF ]

  -F/--min-isoform-fraction    suppress transcripts below this abundance level       [ default:   0.10 ]

  -j/--pre-mrna-fraction       suppress intra-intronic transcripts below this level  [ default:   0.15 ]

  -I/--max-intron-length       ignore alignments with gaps longer than this          [ default: 300000 ]

  -a/--junc-alpha              alpha for junction binomial test filter               [ default:  0.001 ]

  -A/--small-anchor-fraction   percent read overhang taken as 'suspiciously small'   [ default:   0.09 ]

  --min-frags-per-transfrag    minimum number of fragments needed for new transfrags [ default:     10 ]

  --overhang-tolerance         number of terminal exon bp to tolerate in introns     [ default:      8 ]

  --max-bundle-length          maximum genomic length allowed for a given bundle     [ default:3500000 ]

  --max-bundle-frags           maximum fragments allowed in a bundle before skipping [ default: 500000 ]

  --min-intron-length          minimum intron size allowed in genome                 [ default:     50 ]

  --trim-3-avgcov-thresh       minimum avg coverage required to attempt 3' trimming  [ default:     10 ]

  --trim-3-dropoff-frac        fraction of avg coverage below which to trim 3' end   [ default:    0.1 ]

  --max-multiread-fraction     maximum fraction of allowed multireads per transcript [ default:   0.75 ]

  --overlap-radius             maximum gap size to fill between transfrags (in bp)   [ default:     50 ]



Advanced Reference Annotation Guided Assembly Options:

  --no-faux-reads              disable tiling by faux reads                          [ default:  FALSE ]

  --3-overhang-tolerance       overhang allowed on 3' end when merging with reference[ default:    600 ]

  --intron-overhang-tolerance  overhang allowed inside reference intron when merging [ default:     30 ]



Advanced Program Behavior Options:

  -v/--verbose                 log-friendly verbose processing (no progress bar)     [ default:  FALSE ]

  -q/--quiet                   log-friendly quiet processing (no progress bar)       [ default:  FALSE ]

  --no-update-check            do not contact server to check for update availability[ default:  FALSE ]



Supported library types:

	ff-firststrand

	ff-secondstrand

	ff-unstranded

	fr-firststrand

	fr-secondstrand

	fr-unstranded (default)

	transfrags



Supported library normalization methods:

	classic-fpkm

Input Files

  1. <input.file> (required)
    File of RNA-seq read alignments in SAM (a tab-delimited format) or BAM (a compressed binary version of SAM) format.  SAM is a standard short read alignment that allows aligners to attach custom tags to individual alignments.  This file is the output of a read mapping application, such as TopHat, and the alignment section contains information regarding the mapped location of each sequenced RNA-seq read on a reference genome.
    For more information on the SAM format, see the specification.

    Cufflinks will accept SAM alignments generated by any read mapper.  These must, however use the custom 'xs' tag.  This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).

    Also, the SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already.  If not, this can be done with the SortSam module.
  2. <GTF> (optional)
    A tab-delimited reference annotation file in GTF format.  This file is used by Cufflinks to estimate abundances of isoforms. These reference annotation files can be downloaded for many genomes from sites like UCSC Genome Browser.  For more information on the GTF format, see the specification.
    The GenePattern FTP site hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).

  3. <GTF.guide> (optional)
    A tab-delimited reference annotation file in GTF format.  This file is used by Cufflinks to guide RABT assembly.
    The GenePattern FTP site hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).

  4. <mask.file> (optional)
    A tab-delimited GTF file that specifies transcripts to be ignored.

  5. <frag.bias.correct> (optional)
    Reference multi-FASTA file for bias detection and correction algorithm.   For more information on the FASTA format, see this description.
    The GenePattern FTP site hosts a number of reference genomes, available in a dropdown selection (requires GenePattern 3.7.0+).

Output Files

  1. transcripts.gtf
    This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id" and "transcript_id"). There is one GTF record per row, and each record represents either a transcript or an exon within a transcript.
  2. genes.fpkm_tracking
    This is a tab-delimited file containing one row per gene; the columns contain the attributes in the GTF file.  This file contains gene-level coordinates and expression values.  Note that since the output for Cufflinks is for a single sample, the "q" numbering format (see the file format information) is not used.
  3. isoforms.fpkm_tracking
    This is a tab-delimited file containing one row per isoform; the columns contain the attributes in the GTF file.  This file contains transcript-level coordinates and expression values.  Note that since the output for Cufflinks is for a single sample, the "q" numbering format (see the file format information) is not used.

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
any

Language:
any

Version Comments

Version Release Date Description