Provides a variety of options for trimming Illumina FASTQ files of adapter sequences and low-quality reads.
Author: Anthony Bolger et al, Usadel Lab, Rheinsch - Westfalische Technische Hochschule Aachen
Algorithm Version: 0.32
The GenePattern Trimmomatic module conducts quality-based trimming and filtering of FASTQ-formatted short read data produced by Illumina sequencers. The module can also be used to remove adapters and other Illumina technical sequences from the read sequences. The module operates on both paired end or single end data. With paired end data the tool will maintain correspondence of read pairs and also use the additional information contained in paired reads to better find adapter sequences contaminating the read data. The module wraps the Trimmomatic command line tool [Bolger, et al., 2014]. Using the command line tool, a user specifies which trimming/filtering operations to employ and how the selected operations are to be ordered. The GenePattern Trimmomatic module directly exposes through its GUI six of the most frequently used trimming/flitering operations and enforces a particular relative ordering, supporting the most common usage scenarios. Through an extra.steps parameter, GenePattern users may directly specify Trimmomatic command line options and thus gain access to the underlying tool's full range of functionality.
The goal of FASTQ trimming and filtering is to remove low-quality base calls from reads, and to remove detrimental artifacts introduced into the reads by the sequencing process. The removal of low quality reads and contaminating sequences will improve processing by downstream tools such as aligners. The tool provides operations to detect and remove known adapter fragments (adapter.clip), remove low-quality regions from the start and end of the reads (trim.leading and trim.trailing), drop short reads (min.read.length), as well as operations with different quality-filtering strategies for removing low-quality bases within the reads (max.info and sliding.window).
Trimmomatic works with Illumina FASTQ files using phred33 or phred64 quality scores. The appropriate setting depends on the Illumina pipeline used. The default is phred33, which matches modern Illumina pipelines. Correct specification of the phred encoding is critical to successful trimming. The tool will incorrectly interpret the quality values in the FASTQ if the wrong encoding is specified.
The following operations are available directly from the module parameters. They will be executed in the following order, though all operations are optional:
In order to simplify the workflow in GenePattern, these operations always execute in the above order when specified through the module parameters. This order allows for the most common example workflows and also matches the general recommendations of the Trimmomatic documentation. The underlying trimming engine is much more flexible; if you have the need for that increased flexibility, it can be accessed through the extra.steps parameter.
A typical usage scenario involves operations 1, 2, 3, and 6 along with either the max.info or sliding.window operation (using both of these together is not recommended). The adapter.clip step is done first as the known adapter sequences are more likely to be recognized within the original read than in one that has been modified by another trimming step. The trim.leading and trim.trailing happen next and are often used with a very low phred threshold to quickly remove the special Illumina 'low-quality regions' at the start and end of the reads as a precursor to the subsequent, more sophisticated max.info and sliding.window quality-filtering operations. Finally, the min.read.length step is used to drop any read shorter than a desired length.
FastQC, used for quality assessment of the raw reads, includes an analysis of overrepresented sequences. When conducting this analysis, FastQC also checks to see whether any overrepresented sequences correspond to known Illumina adapter and primer sequences. If the resulting Overrepresented Sequences report flags matches with known adapter or primer sequences, these can be removed by using the adapter.clip step.
Of the two quality-filtering operations, max.info is newer and more sophisticated, and is recommended over the older sliding.window strategy by the Trimmomatic authors. One important feature of max.info is that it can be tuned to be more strict or tolerant based on the expected downstream use, where 'strict' applications favor stronger alignment accuracy (e.g., are more sensitive to base mismatches) and 'tolerant' applications favor longer reads (where downstream tools or analysis can tolerate or correct for larger numbers of mismatches or indels). Reference-based RNA-Seq would tend to be in the former category while assembly or variant finding would be in the latter. sliding.window, however, remains an affective method for quality-based trimming of RNA-Seq short reads; its input parameters are more easily interpreted than max.info's and there are established guidelines for their settings.
The module can also be used to convert into a specific phred encoding through the convert.phred.scores parameter. At least one processing step must be chosen, either from operations 1-6, convert.phred.scores or extra.steps.
For single-ended data, a single input file is specified and the module will create a single output file of trimmed/filtered reads. For paired-end data, two input files, one for each mate of the paired-end reads, are specified and the module will create four output files, two for the ‘paired’ output where both reads survived the processing, and two for the corresponding ‘unpaired’ output containing reads where only one of two paired reads survived trimming/filtering.
Details of each of the available steps are explained below in turn. For reference, the underlying operations are listed as well; these are further described in the Trimmomatic manual.
|Phred Quality score (standard Sanger variant)||base call error probability|
Finally, any trimming operations specified in the extra.steps parameter will be performed after those in the above predefined list. Such operations must be specified using the exact syntax found in the Trimmomatic manual; use spaces to separate multiple operations. This allows you to perform operations in a different order than the list above, or to access other operations not presented here. Even when using extra.steps, it is still recommended that adapter.clip (ILLUMINACLIP) be performed first and that MINLEN be performed last. Note that because ILLUMINACLIP requires a file parameter it is highly inconvenient to use through extra.steps due to the need to specify a server-side file path. For this reason, it is best to use the adapter.clip parameters rather than specifying ILLUMINACLIP through extra.steps.
Trimmomatic supports three other trimming operations not presented in the predefined list above:
CROP and HEADCROP were left out of the above predefined list as their use is somewhat at odds with the other quality-based and adaptive approaches. Certain trimming strategies simply want to cut a certain number of bases from the start and/or end of every read and nothing more. Use these operations through extra.steps if that is your goal.
AVGQUAL was left out because its use is not well documented in the Trimmomatic manual, making it unclear where to place it in the overall order and leaving its use harder to explain. Our understanding is that it is similar to the sliding.window approach but always applied at the level of the entire read. As such, it is probably best to use only one of AVGQUAL or sliding.window.
The parameter setting recommendations are largely based on the Trimmomatic manual and the example included near the end. For paired-end data, the corresponding settings for this example would be:
For single-ended data you would (obviously) provide only input.file.1 and leave input.file.2 blank, and use TruSeq3-SE.fa as the adapter.clip.sequence.file (again, adjusted according to your platform. The adapter.clip.palindrome.clip value of 30 should still be specified, though it will be ignored for this usage.
Bolger(2014) provides several examples of the use of a Maximum Information approach. It used the following settings for a 'strict' alignment application:
Finally, here is an example using extra.steps to perform a simple trimming of reads past the 45th base, followed by removal of the first 5 bases, and then dropping any reads with length under 36:
Note that we are not recommending the last example as a ideal trimming approach. It is simply illustrative of the use of extra.steps and some of the additional Trimmomatic operations.
Trimmomatic manual. This documentation was adapted largely based on this documentation.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012 Jul;40(Web Server issue):W622-7.
|input file 1 *||The input FASTQ to be trimmed. For paired-end data, this should be the forward ("*_1" or "left") input file.|
|input file 2||The reverse ("*_2" or "right") input FASTQ of paired-end data to be trimmed.|
|output filename base *||A base name to be used for the output files.|
|adapter clip sequence file||A FASTA file containing the adapter sequences, PCR sequences, etc. to be clipped. This parameter is required to enable adapter clipping. Files are provided for several Illumina pipelines but you can also provide your own; see the manual for details. Be sure to choose a PE file for paired-end data and an SE file for single-end data. See the manual for details on creating your own adapter sequence file.|
|adapter clip seed mismatches||Specifies the maximum mismatch count which will still allow a full match to be performed. A value of 2 is recommended. This parameter is required to enable adapter clipping.|
|adapter clip palindrome clip threshold||Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment. This is the log10 probability against getting a match by random chance; values around 30 or more are recommended. This parameter is required to enable adapter clipping.|
|adapter clip simple clip threshold||Specifies how accurate the match between any adapter etc. sequence must be against a read as a log10 probability against getting a match by random chance; values between 7-15 are recommended. This parameter is required to enable adapter clipping.|
|adapter clip min length||In addition to the alignment score, palindrome mode can verify that a minimum length of adapter has been detected. If unspecified, this defaults to 8 bases, for historical reasons. However, since palindrome mode has a very low false positive rate, this can be safely reduced, even down to 1, to allow shorter adapter fragments to be removed.|
|adapter clip keep both reads *||Controls whether to keep both forward and reverse reads when trimming in palindrome mode. The reverse read is the same as the forward but in reverse complement and so carries no additional information. The default is "yes" (retain the reverse read) which is useful when downstream tools cannot handle a combination of paired and unpaired reads.|
|trim leading quality threshold||Remove low quality bases from the beginning. As long as a base has a value below this threshold the base is removed and the next base will be investigated. See the Usage section above for recommendations.|
|trim trailing quality threshold||Remove low quality bases from the end. As long as a base has a value below this threshold the base is removed and the next trailing base will be investigated. See the Usage section above for recommendations.|
|max info target length||This parameter specifies the read length which is likely to allow the location of the read within the target sequence to be determined. A typical value for target length is 40.|
|max info strictness||This value, which should be set between 0 and 1, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases. A low value of this parameter (<0.2) favors longer reads, while a high value (>0.8) favors read correctness. Both max.info.target.length and max.info.strictness are required for the Max Info quality trim. Examples presented in [Bolger, 2014] employ a value of 0.4 for "tolerant" applications and values from 0.9 all the way up to 0.999 for "strict" applications.|
|sliding window size||Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read. This parameter specifies the number of bases to average across. See the Usage section above for recommendations.|
|sliding window quality threshold||Specifies the average quality required for the sliding window trimming. Both sliding.window.size and sliding.window.quality.threshold are required to enable the sliding window trimming. See the Usage section above for recommendations.|
|min read length||Remove reads that fall below the specified minimal length.|
|extra steps||Extra steps to be performed after any other processing. These must be specified in exactly the format described in the Trimmomatic manual; see the documentation for details. This is recommended for advanced users only.|
|phred encoding *||Allows you to specify the phred quality encoding. The default is phred33, which matches modern Illumina pipelines.|
|convert phred scores||Convert phred scores into a particular encoding. Leave this blank for no conversion.|
|create trimlog *||Create a log of the trimming process. This gives details on what operations were performed, etc. but can be quite lengthy.|
* - required
Use the output.filename.base to specify a base to be used in naming for the output files that will be created. By default, this will be the name of input.file.1 with the both the FASTQ (.fq or .fastq) and compression (.gz or .bz2) extensions removed. Also, if this name (minus extensions) ends in "_1", then this will also be removed to avoid producing output files with confusing names. For example, if input.file.1 is "my_reads_1.fastq.bz2" (presumably paired with "my_reads_2.fastq.bz2") then the module will use "my_reads" as the output.filename.base when creating output files. The names in the list below reflect this naming scheme.
Output FASTQ files will normally use the .fq extension, though if the original input.file.1 used the .fastq extension then this will be used instead. The names in the list below will use .fq with no compression extension, for the sake of uniformity.
FASTQ files compressed using either gzip or bzip2 are supported and are automatically identified by use of the .gz or .bz2 file extensions. Note: we have seen severe issues with Trimmomatic hanging indefinitely when asked to bz2-compress output and so this feature has been disabled; there are no issues with .bz2 input. If the input file is compressed (as either .gz or .bz2) then the output will be as well (though always using gzip).