What is the difference between DISCOVAR and DISCOVAR de novo?

DISCOVAR is a variant caller and small genome assembler. The heart of DISCOVAR is a de novo genome assembler, one that is accurate enough to produce assemblies that can be used for variant calling given a reference sequence. DISCOVAR can also generate de novo assemblies for small genomes, but consider using DISCOVAR de novo instead which can assemble genomes up to mammalian size.

DISCOVAR de novo is a large (and small) de novo genome assembler. It quickly generates highly accurate and complete assemblies using the same single library data as used by DISCOVAR. It currently doesn’t support variant calling – for that, please use DISCOVAR instead.

What are the inputs required to run DISCOVAR and DISCOVAR de novo?

DISCOVAR and DISCOVAR de novo have specific requirements for input data.

DISCOVAR and DISCOVAR de novo require a single Illumina fragment (paired end) library. We strongly recommend using a PCR-free protocol. From the library, 250 base paired reads can be created using either Illumina MiSeq or HiSeq2500 genome sequencers. The recommended coverage is about 60x. Somewhat higher or lower coverage is fine. Longer Illumina reads also work.

For variant calling you must also supply a reference for your genome – in FASTA format.

Can you tell me more about the PCR-free library?

For a human genome, this can be made from 0.5 ug DNA.  Please see the protocol, which typically yields fragments of size ~450 bp. This is achieved by size selection using SPRI beads, rather than a gel. This method yields a wide size distribution, including some longer fragments, which is advantageous.

Should I add jumping library data?

Data from jumping libraries* will improve DISCOVAR and DISCOVAR de novo assemblies, however we do not yet support their use.
*sometimes referred to as mate pair libraries.

Can I use other types of Illumina data?

Possibly, although the results will likely not be as good as those obtained from the recommended DISCOVAR data. Here are some tips:

  • Reads longer than 250 bases can be used.
  • PCR-amplified libraries can in principle be used, but there will be degradation in the quality of assemblies and variant calls. Use of PCR-amplified libraries is not supported. (However, some users have reported excellent results using PCR-amplified libraries, so we plan to add support as soon as we have a chance.)
  • Reads as short as 150 bases may work with DISCOVAR and DISCOVAR de novo, depending on the fragment size and other factors, however algorithmic changes are likely needed to optimally exploit such data.
  • Short reads made from long fragments cannot be used. DISCOVAR closes fragments by extending into the gap with other reads, which then must overlap – see the diagram below. For this to work, the fragment length must be substantially smaller than four times the read length. For example, 100 base reads from ~400 bp fragments will not work.
       ---------->            <----------     original read pair
               ----------                     extending read
                       ----------             extending read

Can I use reads from another sequencing technology?

No. However, we intend to support promising new technologies. See our roadmap.

Can DISCOVAR carry out a de novo assembly of a human-sized genome?

No, but DISCOVAR de novo can. However, DISCOVAR can call variants on human genomes.

Can DISCOVAR carry out de novo assemblies of microbial-sized genomes?

Yes, as can DISCOVAR de novo.

Can DISCOVAR de novo call variants

Not currently, but DISCOVAR can.

Can I call variants on a large genome (e.g. human)?

Absolutely! Although DISCOVAR can currently only assemble small genomes, it is possible to instead assemble smaller portions of a larger genome. You simply specify the region of the genome you are interested in, and DISCOVAR will do the rest. This requires that you first align your reads to a reference, and provide DISCOVAR with a resulting BAM file. The alignments are used to localize reads to the region of interest, and are not used in the assembly process. Our goal is to add variant calling to DISCOVAR de novo, allowing the discovery of completely novel sequence as well as variants.

What size of region should I use when calling variants with DISCOVAR?

We recommend using small regions, for example ~100 kb.  It is possible to run on larger regions, but DISCOVAR will not at present scale to the entire human genome, and it is often easier to interpret the assemblies of smaller regions.

How do DISCOVAR and DISCOVAR de novo represent genome assemblies?

DISCOVAR and DISCOVAR de novo genome assemblies are graphs, with edges representing sequence. Each edge is given as a record in a FASTA file, with graph connectivity information recorded in the header (>…) lines.

DISCOVAR generates a graphical representation of the assembly using the dot file format. For more information, please see the manual.

DISCOVAR de novo extends this by decomposing the graph in to linear structures called lines. For more information see the DISCOVAR de novo manual and the primer.

How can one view the assembly graph?

DISCOVAR generates dot files which may be viewed with Graphviz.

DISCOVAR de novo assemblies are generally too large to be viewed in their entirety. Instead we have created a visualization tool called NhoodInfo that allows you to interactively explore the assembly graph. NhoodInfo is part of the DISCOVAR de novo package, and there is also an online demo. For more information see the NhoodInfo manual.

How does DISCOVAR represent variants?

Variants are listed in a human-readable plain text file, in a transitional format that is specific to DISCOVAR. This format will be expanded in the near future. We are also working to translate the format to VCF, however enhancements to VCF will be needed to accommodate complex variation features.

Can I use DISCOVAR to call variants in a population?

No. DISCOVAR is designed to work with single samples, not populations. Try using a tool like GATK instead.

In the Illumina pipeline, should I turn off the EAMSS filtering that generates bases having quality score 2?

We recommend leaving it turned on – but are continuing to investigate the impact of this change.

I wish to use DISCOVAR and DISCOVAR de novo in my research, do I need a commercial license?

No. DISCOVAR and DISCOVAR de novo are released under the term of this license, but we do encourage you to register with us. To register simply email us, stating your name, organization name and details.

My company would like to use DISCOVAR and DISCOVAR de novo, do I need to purchase a license?

No. DISCOVAR and DISCOVAR de novo are released under the term of this license, but we do encourage you to register with us. To register simply email us, stating your name, organization name and details.

How do DISCOVAR and DISCOVAR de novo relate to ALLPATHS-LG?

Currently the application spaces of ALLPATHS-LG and DISCOVAR are mostly complementary. Notably, ALLPATHS-LG can be used to assemble 100 base Illumina reads, and it has capabilities not yet available in DISCOVAR, including the ability to work with multiple libraries. However, DISCOVAR de novo offers the ability to create higher quality assemblies at considerably lower cost than using ALLPATHS-LG, given the appropriate data.