Memory limit added

DISCOVAR de novo now has an argument MAX_MEM_GB that can be used to limit memory usage to roughly the given amount. This can be useful on very large shared-memory systems.

Multiple sample support added

We have added support for assembly of multiple related samples to DISCOVAR de novo. For example,

READS="sample:T :: t.bam + sample:N :: n.bam"

will assemble together data from two bam files t.bam and n.bam, and in so doing keep track of their sample identities as “T” and “N”. These sample identities are carried forward and may be seen during visualization, for example to show the number of reads from each sample supporting a given edge. This feature is compatible with the FASTQ support described in the previous post. We will add full documentation later.

FASTQ support added

We’ve added FASTQ support to DISCOVAR de novo. Pretty much any reasonable syntax for READS="..." including “globable” wild card characters should be interpreted correctly. The interlaced and non-interlaced cases should be correctly distinguished. Allowed suffixes are .fastq, .fastq.gz, .fq and .fq.gz. We will provide detailed documentation later.

Memory guidelines

DISCOVAR de novo requires roughly 2 bytes of memory for each base of input data. The program now provides feedback for a given run. For example, if you have only 1 byte of memory per base, a warning will be issued. We have also fixed bugs associated with having more than 231 (about two billion) reads.

Checking read pair chimerism

DISCOVAR de novo now reports the fraction of read pairs that appear to be chimeric. Fractions of around 1% are expected and probably due to artifacts of read mapping within the assembly. Fractions much higher than this are indicative of a serious problem, most likely a computational scrambling of the read pairs defined as input to DISCOVAR de novo.