GenomeSTRiP Documentation | Created 2012-09-12 | Last updated 2013-01-10

Comments (0)

1. Introduction

The ComputeGenomeMask utility determines the alignability of each base in the reference genome.

Mask files are generated based on a fixed read length L. A base is considered alignable if a window of length L centered on the base is unique within the reference sequence.

The ComputeGenomeMask utility works by dividing the input genome into sequences of length L for every base position and then aligning these sequences against the reference genome (using BWA) and looking for unique matches.

The implementation of genome masking in Genome STRiP is likely to change in the future.

2. Inputs / Arguments

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence to mask. : The fasta file must have been indexed using BWA in preparation for BWA alignment. The fasta file must also be indexed with 'samtools faidx' or the equivalent.

  • -readLength <length> : The size of the window to use for determining alignability.

  • -sequence <sequenceName> : The name of the sequence to process (default is to process the entire genome). This can be used to parallelize the computation by chromosome.

3. Outputs

  • -O <mask-file> : Output genome mask file (fasta format). Default is to write to stdout. The mask file contains '0' at alignable positions and '1' at non-alignable positions.

4. Running

ComputeGenomeMask is part of Genome STRiP, which is a third-party GATK library.

An example invocation is shown below:


java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \ \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -O Homo_sapiens_assembly18.mask.chr1.36.fasta \ 
    -readLength 36 \ 
    -sequence chr1 

To generate a mask for the entire genome, it is generally preferably to compute the mask for each chromosome separately and then concatenate the output files in the correct order.

The following script provides an example of generating the appropiate indexes and running ComputeGenomeMask in parallel on each chromosome. This is an example and not a general purpose script; it will likely need to be modified for your environment. After each chromosome has been run, the resulting files must be concatenated together in the same order as the reference sequence to create the final mask.


outdir=example readLength=75 reference=Homo_sapiens_hg19.fasta
export SV_DIR=/humgen/cnp04/bobh/svtoolkit/stable

#These executables must be on your path.
which java > /dev/null || exit 1
which bwa > /dev/null || exit 1
#The directory containing must be on your LD_LIBRARY_PATH

mkdir -p ${outdir}/work

localReference=${outdir}/work/`echo ${reference} | awk -F / '{ print $NF }'`
if [ ! -e ${localReference} ]; then ln ${reference} ${localReference} || exit 1 fi

java -cp ${classpath} -Xmx4g \ \ 
    -I ${localReference} \ 
    -O ${localReference}.fai \
    || exit 1

bwa index -a bwtsw ${localReference} || exit 1

chroms=`cat ${localReference}.fai | cut -f 1`
for chr in ${chroms}; do
    bsub -o ${outdir}/work/svmask_${chr}.log \ 
    -R "rusage[mem=5000]" \ 
    java -cp ${classpath} -Xmx4g \ \ 
        -R ${localReference} \ 
        -O ${outdir}/work/svmask_${chr}.fasta \ 
        -readLength ${readLength} \ 
        -sequence ${chr} \
        || exit 1

Return to top Comment on this article in the forum