Beta tutorial Please report any issues in the comments section.
PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.
PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.
This tutorial describes:
A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.
Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the GATK Resource Bundle FTP server in /bundle/pathseq/ (see readme file). This tutorial also covers how to build custom host and microbe references.
The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.
tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:
> tar xzvf pathseq_tutorial.tar.gz > cd pathseq_tutorial
You should now have the following files in your current directory:
The pipeline accepts reads in BAM format (if you have FASTQ files, please see this article on how to convert to BAM). In this example, the pipeline can be run using the following command:
> gatk PathSeqPipelineSpark \ --input test_sample.bam \ --filter-bwa-image hg19mini.fasta.img \ --kmer-file hg19mini.hss \ --min-clipped-read-length 70 \ --microbe-fasta e_coli_k12.fasta \ --microbe-bwa-image e_coli_k12.fasta.img \ --taxonomy-file e_coli_k12.db \ --output output.pathseq.bam \ --scores-output output.pathseq.txt
This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.
The PathSeq output files are:
|2||... cellular_organisms Bacteria||superkingdom||Bacteria||Bacteria||189580||100||189580||189580||0|
|1236||... Proteobacteria Gammaproteobacteria||class||Gammaproteobacteria||Bacteria||189580||100||189580||189580||0|
|91347||... Gammaproteobacteria Enterobacterales||order||Enterobacterales||Bacteria||189580||100||189580||189580||0|
|543||... Enterobacterales Enterobacteriaceae||family||Enterobacteriaceae||Bacteria||189580||100||189580||189580||0|
|561||... Enterobacteriaceae Escherichia||genus||Escherichia||Bacteria||189580||100||189580||189580||0|
|562||... Escherichia Escherichia_coli||species||Escherichia_coli||Bacteria||189580||100||189580||189580||0|
|83333||... Escherichia_coli Escherichia_coli_K-12||no_rank||Escherichia_coli_K-12||Bacteria||189580||100||189580||189580||0|
Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:
In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.
PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the
samtools view command:
> samtools view –c output.pathseq.bam 189580
Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.
Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:
The BWA index images must be build using BwaMemIndexImageCreator:
> gatk BwaMemIndexImageCreator -I host.fasta > gatk BwaMemIndexImageCreator -I microbe.fasta
The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:
> gatk PathSeqBuildKmers \ --reference host.fasta \ -O host.hss
Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ Download NCBI taxonomy data files dump (no need to extract the archive): ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:
> gatk PathSeqBuildReferenceTaxonomy \ -R microbe.fasta \ --refseq-catalog RefSeq-releaseXX.catalog.gz \ --tax-dump taxdump.tar.gz \ -O microbe.db
The preceding instructions can be conveniently executed with the following bash script:
#!/bin/bash set -eu GATK_HOME=/path/to/gatk REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz TAXDUMP=/path/to/taxdump.tar.gz echo "Building pathogen reference..." $GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta $GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db echo "Building host reference..." $GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta $GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss
Increase the Java heap limit. For example, to increase the limit to 4GB with the
> gatk --java-options "-Xmx4G" ...
This should generally be set to a value greater than the sum of all reference files.
The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an
RG tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.