GenePattern

The Hisat2.indexer generates genome indexes for the Hisat2Aligner module. HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The

Author: Ted Liefeld

Contact:

Ted Liefeld, jliefeld@cloud.ucsd.edu

Algorithm Version: 2.1.0

Introduction

The Hisat2.indexer uses HISAT2's hisat2-build script to build a HISAT2 index from a set of DNA sequences. It outputs a set of 6 files with suffixes .1.ht2, .2.ht2, .3.ht2, .4.ht2, .5.ht2, .6.ht2, .7.ht2, and .8.ht2. In the case of a large index these suffixes will have a ht2l termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by HISAT2 once the index is built.

Algorithm

Use of Karkkainen's blockwise algorithm allows hisat2-build to trade off between running time and memory usage. By default, hisat2-build will automatically search for the settings that yield the best running time without exhausting memory. The HISAT2 index is based on the FM Index of Ferragina and Manzini, which in turn is based on the Burrows-Wheeler transform. The algorithm used to build the index is based on the blockwise algorithm of Karkkainen.

References

Sirén et al. 2014

https://ccb.jhu.edu/software/hisat2/index.shtml

Parameters

Name	Description
index name prefix*	The name prefix of the resulting index files and of the zip file which contains them.
fasta file	One or more FASTA files (or a zip file containing one or more FASTA files) containing the reference sequences to be aligned to. E.g., `<reference_in>` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`.
gtf file	Optional GTF file with information about exons. If present this will run extract_exons.py and extrac_splice_sites.py on the GTF file and then add the splice sites and exons to the index
dry run*	When true, the module only prints the hisat command-line that would be sent to the program's standard output file (stdout.txt) but does not execute the alignment. Useful for testing or generating a command line to run HISAT2 outside of GenePattern.

* - required

Input Files

fasta file
One or more FASTA files (or a zip file containing one or more FASTA files) containing the reference sequences to be aligned to. E.g., <reference_in> might be chr1.fa,chr2.fa,chrX.fa,chrY.fa.
Example FASTA input files can be found at reads_1.fa, and reads_2.fa
gtf file
A GTF file containing splice site gene annotations.
An example input gtf file can be found at Homo_sapiens_hg19_UCSC.gtf.

Output Files

genome.zip
A zip file containing the 6 index files created by the indexer, suitable for use with the Hisat2Aligner module.

Requirements

This module is implemented using a Docker container to provide the environment.

Platform Dependencies

Task Type:

CPU Type:

Operating System:

Language:

Version Comments

Version	Release Date	Description
2	2021-10-11	Rename HISAT2Indexer to HISAT2.indexer
1	2018-10-25	Initial production release

Hisat2Indexer (v1)