When viewing a DISCOVAR de novo assembly with NhoodInfo
, from multisample data, one may now flag edges having any specified pattern of presence or absence of reads from given samples. For example in a three-sample assembly of child, mother, father, the command PURPLE=100
will cause edges having only reads from the child to be flagged as purple. This change takes effect as of revision 52401.
Author Archives: David Jaffe
Input spec documentation
We’re adding a detailed description of the input specification for DiscovarDeNovo
. It is quite general and allows several file types as well as path descriptions. Please let us know if anything is unclear or could use improvement. The new doc will appear with release 52345.
New statistic for detecting run problems
We’ve added a new statistic ‘MPL1′, the mean length of the first read in a pair up to the first error. This turns out to be diagnostic for certain run failures. The normal range is 175-225 for 250 base reads. A much smaller value would indicate that something has gone badly wrong (which is a rare event). This stat goes live as of revision 52335.
Better memory usage
We’ve made two changes to DISCOVAR de novo:
1. Memory usage during BAM file reading is now lower.
2. We’ve added a memory check to see how much memory appears to be available, and then throttle memory usage to that level. Sometimes (and particularly on scheduled systems like SGE) the amount of available memory is reduced. Please let us know if you observe aberrant behavior. The feature can be turned off by setting MEMORY_CHECK=False
on the DiscoverExp
command line.
Reference support added
We’ve added support for use of a reference sequence with DISCOVAR de novo. If you provide a reference sequence, then your assembly will be created de novo, then aligned to the reference sequence. Then when you view the assembly using NhoodInfo
, the view will be marked to show reference coordinates.
To use this new feature, supply the command-line option REFHEAD=g
, where you have files g.fasta
and g.names
. The file g.names
should be a text file having the same number of lines as g.fasta
has records, with each line being a name for the corresponding record (e.g. chr3). These names are displayed by NhoodInfo
, so you want them to be reasonably short to avoid crowding the display.
Fractional read support added
We’ve added support for use of part of a read set. For example,
READS="frac:0.6 :: x.bam"
will cause 60% of the read pairs in x.bam
to be chosen at random and assembled. This is useful to understand the effect of lower coverage and in cases where one has too much data to assemble on a given machine. The frac
option can be combined with the sample
option e.g.
READS="sample:T :: t.bam + sample:N,frac:0.5 :: n.bam"
to use all the reads in t.bam
but only half of the reads in n.bam
.
Memory limit added
DISCOVAR de novo now has an argument MAX_MEM_GB that can be used to limit memory usage to roughly the given amount. This can be useful on very large shared-memory systems.
Multiple sample support added
We have added support for assembly of multiple related samples to DISCOVAR de novo. For example,
READS="sample:T :: t.bam + sample:N :: n.bam"
will assemble together data from two bam files t.bam
and n.bam
, and in so doing keep track of their sample identities as “T” and “N”. These sample identities are carried forward and may be seen during visualization, for example to show the number of reads from each sample supporting a given edge. This feature is compatible with the FASTQ support described in the previous post. We will add full documentation later.
FASTQ support added
We’ve added FASTQ support to DISCOVAR de novo. Pretty much any reasonable syntax for READS="..."
including “globable” wild card characters should be interpreted correctly. The interlaced and non-interlaced cases should be correctly distinguished. Allowed suffixes are .fastq
, .fastq.gz
, .fq
and .fq.gz.
We will provide detailed documentation later.
Memory guidelines
DISCOVAR de novo requires roughly 2 bytes of memory for each base of input data. The program now provides feedback for a given run. For example, if you have only 1 byte of memory per base, a warning will be issued. We have also fixed bugs associated with having more than 231 (about two billion) reads.