Hisat2Aligner (v1)

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Please refer to https://ccb.jhu.edu/software/hisat2/index.shtml for details of the algorithm.

Author: Ted Liefeld

Contact:

Ted Liefeld, jliefeld@cloud.ucsd.edu

Algorithm Version: 2.1.0

Introduction

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome).

Algorithm

Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM). 

References

Sirén et al. 2014

https://ccb.jhu.edu/software/hisat2/index.shtml

 

Parameters

Name Description
Input
index* Directory or zip file containing a Hisat2 genome index to be aligned to.
reads pair 1* Unpaired reads file or first mate for paired reads. One or more files containing reads in FASTA or FASTQ format (bz2 and gz compressed files are supported).
reads pair 2* Second mate for paired reads. Zero or more files in FASTA or FASTQ format (bz2 and gz compressed files are supported).
input format* The format of the input reads files.  May be fastQ, fastA, raw (one sequence per line) or Illumina qseq format.
quality value scale* Whether to use the Solexa, Phred 33, Phred 64 quality value scale
integer quality value Quality values are represented in the read input file as space-separated ASCII integers, e.g., 40 40 30 40..., rather than ASCII characters, e.g., II?I.... Integers are treated as being on the Phred quality scale unless "Solexa" is also specified for the <quality value scale>.
mate orientations* The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. 
novel splice sites infile  An optional list of known splice sites, which HISAT2 makes use of to align reads with small anchors. (This is the output of novel splice sites - See Supplementary Output)
 
Output
output prefix* The prefix to use for the output file name.
dry run* When true, the module only prints the hisat command-line that would be used to the program's standard output file (stdout.txt) but does not execute the alignment.  Useful for testing or generating a command line to run HISAT2 outside of GenePattern.
Advanced Customization of Run
max reads to align* Align the first # reads or read pairs from the input (after the `-s`/`--skip` reads or pairs have been skipped), then stop. Mainly useful for testing.
ignore read qualities* When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value. I.e. input is treated as though all quality values are high. This is also the default behavior when the input doesn't specify quality values
align* Align unpaired reads against the forward reference strand only, the reverse-complement (Crick) reference strand only, or both.
min mismatch penalty* Sets the minimum (`MN`) mismatch penalty. A number less than or equal to <max mismatch penalty> (`MX`) and greater than or equal to `MN` is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an `N`. If <ignore read qualities> is specified, the number subtracted quals `MX`. Otherwise, the number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )` where Q is the Phred quality value. Default: `MX` = 6, `MN` = 2.
max mismatch penalty* Sets the maximum (`MX`)  mismatch penalty. A number less than or equal to `MX` and greater than or equal to <min mismatch penalty> (`MN`) is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an `N`. If <ignore read qualities> is specified, the number subtracted quals `MX`. Otherwise, the number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )` where Q is the Phred quality value. Default: `MX` = 6, `MN` = 2.
soft clipping* Allow or disallow soft clipping.
min softclip penalty* Sets the minimum (MN) penalty for soft-clipping per base. A number less than or equal to the max softclip penalty (MX) and greater than or equal to MN is subtracted from the alignment score for each position. The number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 2, MN = 1.
max softclip penalty* Sets the maximum (MX) penalty for soft-clipping per base. A number less than or equal to MX and greater than or equal to the min softclip penalty (MN) is subtracted from the alignment score for each position. The number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 2, MN = 1.
min n ceil* Sets the minimum value in a linear function governing the maximum number of ambiguous characters (usually `N`s and/or `.`s) allowed in a read as a function of read length. 
max n ceil* Sets a maximum in a linear function governing the maximum number of ambiguous characters (usually `N`s and/or `.`s) allowed in a read as a function of read length. 
ambiguous read penalty* Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as `N`. 
read gap open penalty* Sets the read gap open penalty. A read gap of length N gets a penalty of `` + N * ``. 
read gap extend penalty* Sets the read gap extend penalty. A read gap of length N gets a penalty of `` + N * ``. 
reference gap open penalty Sets the reference gap open penalty. A read gap of length N gets a penalty of `` + N * ``. 
reference gap extend penalty* Sets the reference gap extend penalty. A read gap of length N gets a penalty of `` + N * ``. 
spliced alignments* Disables spliced alignments if set to 'no'.
penalty for canonical splice sites* Sets the penalty for each pair of canonical splice sites (e.g. GT/AG). 
penalty for non-canonical splice sites Sets the penalty for each pair of non-canonical splice sites (e.g. non-GT/AG). 
min score align* Sets minimum on the function governing the minimum alignment score needed for an alignment to be considered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying 0 sets the minimum-score function f to f(x) = 0 + -0.6 * x, where x is the read length and -0.6 is the max score align value. 
max score align* Sets maximum on the function governing the minimum alignment score needed for an alignment to be considered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying -0.6 sets the minimum-score function f to f(x) = 0.1 + -0.6 * x, where x is the read length and 0.1 is the min score align value. 
minimum fragment length for paired alignment*

The minimum fragment length for valid paired-end alignments. This option is valid only with no spliced alignment. E.g. if 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as the maximum fragment length is also satisfied). A 19-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the constraint is applied with respect to the untrimmed mates.

The larger the difference between minimum and maximum fragment lengths, the slower HISAT2 will run. This is because larger differences scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very efficient. 

maximum fragment length for paired alignment*

The maximum fragment length for valid paired-end alignments. This option is valid only with no spliced alignment. E.g. if 100 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 60-bp gap between them, that alignment is considered valid (as long as the minimum fragment length is also satisfied). A 61-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the constraint is applied with respect to the untrimmed mates.

The larger the difference between minimum and maximum fragment lengths, the slower HISAT2 will run. This is because larger differences scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very efficient. 

unpaired alignments for paired reads* By default, when `hisat2` cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior. 
discordant alignments for paired reads* By default, `hisat2` looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (`--fr`/`--rf`/`--ff`, `-I`, `-X`). This option disables that behavior.
max seeds extended* HISAT2, like other aligners, uses seed-and-extend approaches. HISAT2 tries to extend seeds to full-length alignments. In HISAT2, --max-seeds is used to control the maximum number of seeds that will be extended. HISAT2 extends up to these many seeds and skips the rest of the seeds. Large values for <max seeds extended>` may improve alignment sensitivity, but HISAT2 is not designed with large values for<max seeds extended> in mind, and when aligning reads to long, repetitive genomes large <max seeds extended> can be very, very slow.
max primary alignments* HISAT2 searches for up to N distinct, primary alignments for each read, where N equals the integer specified with this parameter. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. It is possible that multiple distinct alignments have the same score. That is, if `2` is specified, HISAT2 will search for at most 2 distinct alignments. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field.
secondary alignments Report secondary alignments.
Parameters with default values for long mammalian introns
min penalty long introns with canonical splice sites* Sets the minimum in a natural log penalty function for long introns with canonical splice sites so that alignments with shorter introns are preferred to those with longer ones.
max penalty long introns with canonical splice sites* Sets the maximum in a natural log penalty function for long introns with canonical splice sites so that alignments with shorter introns are preferred to those with longer ones.
min penalty long introns with noncanonical splice sites* Sets the minimum in a natural log penalty function for long introns with noncanonical splice sites so that alignments with shorter introns are preferred to those with longer ones. 
max penalty long introns with noncanonical splice sites* Sets the maximum in a natural log penalty function for long introns with noncanonical splice sites so that alignments with shorter introns are preferred to those with longer ones. 
minimum intron length* Sets minimum intron length. 
maximum intron length Sets maximum intron length. 
Supplementary Output
novel splice sites Optional: output file for novel splice sites found.
mapped reads Optional: write unpaired reads that align at least once to a file.
unmapped reads Write paired-end reads that align concordantly at least once to file(s).

* - required

Input Files

  1. RNA-seq reads files in FASTA/FASTQ format (can be gzip or bzip2 compressed) For more information on the FASTA format, see the NIH description here: 
    http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml. For more information on the FASTQ format, 6 see the specification here: http://nar.oxfordjournals.org/content/early/2009/12/16/nar.gkp1137.full.  Example FASTA input files can be found at reads_1.fa and reads_2.fa.
  2. Custom Hisat2 index (optional, if the prebuilt indexes do not include the genome you need) 
    This file is a genome reference index. You must create this file using Hisat2 (Hisat2 2.0 or higher) and can use the Hisat2Indexer GenePattern module for this.   A large and growing number of hosted genomes are selectable from the parameter, possibly allowing you to avoid this step.  An example Hisat2 index for the sample FASTA files (above) can be found at 22_20-21M_snp.zip

Output Files

  1. <output prefix>.sam 
    A list of read alignments in SAM format. This file can be used as input for Cufflinks. BAM is the binary equivalent of SAM, a compact short read alignment format. For more information on the SAM/BAM formats, see the specification at: 
    http://samtools.sourceforge.net.

Requirements

This module is implemented using a Docker container to provide the environment.  

Platform Dependencies

Task Type:
Sequence Analysis

CPU Type:
docker

Operating System:
ubuntu

Language:

Version Comments

Version Release Date Description
1 2018-10-25 Initial production release