When viewing a DISCOVAR de novo assembly with
NhoodInfo, from multisample data, one may now flag edges having any specified pattern of presence or absence of reads from given samples. For example in a three-sample assembly of child, mother, father, the command
PURPLE=100 will cause edges having only reads from the child to be flagged as purple. This change takes effect as of revision 52401.
We’re adding a detailed description of the input specification for
DiscovarDeNovo. It is quite general and allows several file types as well as path descriptions. Please let us know if anything is unclear or could use improvement. The new doc will appear with release 52345.
We’ve added a new statistic ‘MPL1′, the mean length of the first read in a pair up to the first error. This turns out to be diagnostic for certain run failures. The normal range is 175-225 for 250 base reads. A much smaller value would indicate that something has gone badly wrong (which is a rare event). This stat goes live as of revision 52335.
The DISCOVAR de novo executable has been renamed
DiscovarDeNovo (previously it was called
DiscovarExp). Its operation remains unchanged. We think we’ve managed to update all the documentation, but if you spot any holdovers please let us know.
We’ve made two changes to DISCOVAR de novo:
1. Memory usage during BAM file reading is now lower.
2. We’ve added a memory check to see how much memory appears to be available, and then throttle memory usage to that level. Sometimes (and particularly on scheduled systems like SGE) the amount of available memory is reduced. Please let us know if you observe aberrant behavior. The feature can be turned off by setting
MEMORY_CHECK=False on the
DiscoverExp command line.
We’ve added support for use of a reference sequence with DISCOVAR de novo. If you provide a reference sequence, then your assembly will be created de novo, then aligned to the reference sequence. Then when you view the assembly using
NhoodInfo, the view will be marked to show reference coordinates.
To use this new feature, supply the command-line option
REFHEAD=g, where you have files
g.names. The file
g.names should be a text file having the same number of lines as
g.fasta has records, with each line being a name for the corresponding record (e.g. chr3). These names are displayed by
NhoodInfo, so you want them to be reasonably short to avoid crowding the display.
We’ve added support for use of part of a read set. For example,
READS="frac:0.6 :: x.bam"
will cause 60% of the read pairs in
x.bam to be chosen at random and assembled. This is useful to understand the effect of lower coverage and in cases where one has too much data to assemble on a given machine. The
frac option can be combined with the
sample option e.g.
READS="sample:T :: t.bam + sample:N,frac:0.5 :: n.bam"
to use all the reads in
t.bam but only half of the reads in
DISCOVAR de novo now has an argument MAX_MEM_GB that can be used to limit memory usage to roughly the given amount. This can be useful on very large shared-memory systems.
We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in
a.final and are mirrored in standard output. These along with the file
frags.dist.png are often diagnostic.
DISCOVAR de novo produces several output files, including a file of scaffolds
a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.