Remove cross-contamination for parallel samples

Sometimes parallel processing of samples can result in low level cross contamination, and sometimes there can be enough to assemble, especially when the samples are sequenced at high coverage. We’ve added a program CrossOut that can remove most of this contamination from parallel DISCOVAR de novo assemblies, by looking for improbable molarity differences. It has a single argument DIR, the parent directory for the assemblies, and creates new assembly directories a.clean within each.

New DISCOVAR de novo stats

We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in and are mirrored in standard output. These along with the file frags.dist.png are often diagnostic.

Highest coverage paths now used in scaffolds

DISCOVAR de novo produces several output files, including a file of scaffolds a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.