PrintReads merges or subsets sequence data. The tool automatically applies MalformedReadFilter and BadCigarFilter to filter out certain types of reads that cause problems for downstream GATK tools, e.g. reads with mismatching numbers of bases and base qualities or reads with CIGAR strings containing the N operator.
Subsetting reads corresponding to a genomic interval using PrintReads requires reads that are aligned to a reference genome, coordinate-sorted and indexed. Place the
.bai index in the same directory as the
java -Xmx8G -jar /path/GenomeAnalysisTK.jar \ -T PrintReads \ -R /path/human_g1k_v37_decoy.fasta \ #reference fasta -L 10:91000000-92000000 \ #desired genomic interval chr:start-end -I 6517_2Mbp_input.bam \ #input -o 6517_1Mbp_output.bam
This creates a subset of reads from the input file,
6517_2Mbp_input.bam, that align to the interval defined by the
-L option, here a 1 Mbp region on chromosome 10. The tool creates two new files,
6517_1Mbp_output.bam and corresponding index
To process large files, also designate a temporary directory.
TMP_DIR=/path/shlee #sets environmental variable for temporary directory