We have added support for assembly of multiple related samples to DISCOVAR de novo. For example,
READS="sample:T :: t.bam + sample:N :: n.bam"
will assemble together data from two bam files t.bam
and n.bam
, and in so doing keep track of their sample identities as “T” and “N”. These sample identities are carried forward and may be seen during visualization, for example to show the number of reads from each sample supporting a given edge. This feature is compatible with the FASTQ support described in the previous post. We will add full documentation later.
Category Archives: Release
FASTQ support added
We’ve added FASTQ support to DISCOVAR de novo. Pretty much any reasonable syntax for READS="..."
including “globable” wild card characters should be interpreted correctly. The interlaced and non-interlaced cases should be correctly distinguished. Allowed suffixes are .fastq
, .fastq.gz
, .fq
and .fq.gz.
We will provide detailed documentation later.
Memory guidelines
DISCOVAR de novo requires roughly 2 bytes of memory for each base of input data. The program now provides feedback for a given run. For example, if you have only 1 byte of memory per base, a warning will be issued. We have also fixed bugs associated with having more than 231 (about two billion) reads.
Checking read pair chimerism
DISCOVAR de novo now reports the fraction of read pairs that appear to be chimeric. Fractions of around 1% are expected and probably due to artifacts of read mapping within the assembly. Fractions much higher than this are indicative of a serious problem, most likely a computational scrambling of the read pairs defined as input to DISCOVAR de novo.
Remove cross-contamination for parallel samples
Sometimes parallel processing of samples can result in low level cross contamination, and sometimes there can be enough to assemble, especially when the samples are sequenced at high coverage. We’ve added a program CrossOut
that can remove most of this contamination from parallel DISCOVAR de novo assemblies, by looking for improbable molarity differences. It has a single argument DIR
, the parent directory for the assemblies, and creates new assembly directories a.clean
within each.
Fragment library size distribution plots
Each DISCOVAR de novo assembly will now come with a plot like this
showing the observed size distribution for the fragments defined by the input read pairs, and in the file frags.dist.png
. These plots can be highly diagnostic. They are available from revision 51298 onwards. The raw data are in the file frags.dist
Thread control for DISCOVAR de novo
You can now limit the maximum number of threads DISCOVAR de novo uses with the new option NUM_THREADS
(release 51183). This is useful if you have to share your hardware, or if your system admin has limited the number of threads a single process can use. It can also be a good idea to restrict the number of threads if your hardware has many cores (>50), as the parallelization efficiency can start to drop due to locking and cache coherency issues.
New release
We have fixed the bug in DISCOVAR de novo cited in the last blog message. Please download and use the new version (50964) from our ftp site.
Temporarily rolling back to revision 50693
We found a serious bug in DISCOVAR de novo, revision 50862, resulting in data corruption in some cases. We will correct this bug as soon as possible. In the meantime we are rolling back to revision 50693. And needless to say, we are beefing up our release testing.
Native support added for BAMs in DISCOVAR de novo
The latest release (50893) of DISCOVAR de novo now supports BAM files directly, and no longer requires SAMtools
to be installed. This change has the added benefit of halving the time required to import data from a BAM, potentially saving hours on a human sized genome. Note that the original variant calling version of DISCOVAR still requires SAMtools
in order to work.