How do I run ContEst?

The tool is run using Java; the command to execute the tool looks like:

      java -jar ContEst.jar -T Contamination

The required command-line arguments for the tool:

  • -T Contamination - ContEst is based on the GATK; this is telling the GATK to run the contamination tool
  • -B:genotypes,vcf <your.genotypes.vcf> - Your genotypes files (as a VCF), taken from array data. See below for information about how to convert Birdseed output into VCF.
  • -B:pop,vcf <population_AF_vcf.vcf> - The population allele frequencies for each SNP in HapMap
  • -BTI genotypes - drive the tool by the known genotypes for this sample
  • -I <your_bam.bam> You input BAM, containing the reads for the sample
  • -R <your_copy_of_hg19.fasta> - the FASTA file for the appropriate genome build

Common, optional parameters include:

  • -o <your_output_file.txt> - write the output to this file
  • -pc <precision> - the percision you wish to run the tool with (this 0.1 indicates you'd like to estimate contamination with 0.1 precision)
  • -sn <your_sample_name> - your sample name, as known in the genotypes VCF (optional if only one sample in the genotypes VCF)
  • -llc ?LANE_ - report estimates for each read group in the BAM

ContEst is a Java tool, based on the Genome Analysis toolkit (GATK), and many of it's inputs are processed through the GATK's engine;To get more information on how to run the tool, you can run the following the command:

java -jar ContEst.jar -T Contamination -h

which produces the following output (along with help using the GATK in general):

Arguments for Contamination:
 -o,--out <out>        An output file presented to the walker.  Will overwrite 
                       contents if file exists.
 --trim_fraction <trim_fraction>     
                       what fraction of sites with highest and lowest likelihood 
                       values to trim
 -llc,--lane_level_contamination <lane_level_contamination>   
                       set to META (default), SAMPLE or LANE to produce per-bam, 
                       per-sample or per-lane estimates
 -sn,--sample_name <sample_name>                              
                       The sample name; used to extract the correct genotypes 
                       from mutli-sample truth vcfs
 -pc,--precision <precision>                                  
                       the degree of precision to which the contamination tool 
                       should estimate (e.g. the bin size)
 -vs,--verify_sample   should we veriy that the sample name is in the genotypes 
 -mbc,--minimum_base_count <minimum_base_count>               
                       what minimum number of bases do we need to see to call 
                       contamination in a lane / sample?
 -pop,--population <population>
                       evaulate contamination for just a single contamination 

Example ContEst Command

An example data package is available from the download page.  You'll also need to download the 1000 genomes B37 reference file and the associated fai file:

To run the example, you'll need to have downloaded the ContEst binary zip file and the hg19_population_stratified_af_hapmap_3.3.vcf.gz to your system.  The example data is based on two low contamination level 1000 genomes samples, mixed together.  The command to run is:

java -Xmx2g -jar <ContEst_JAR_Location>/ContEst.jar \
-I <example_data_location>/chr20_sites.bam \
-R <reference_location>/human_g1k_v37.fasta \
-B:pop,vcf <hg19_population_stratified_af_hapmap_3.3_location>/hg19_population_stratified_af_hapmap_3.3.vcf \
-T Contamination \
-B:genotypes,vcf <example_data_location>/hg00142.vcf \
-BTI genotypes \
-o contamination_results_chr20.txt

This example will produce an output file which should look like the following:

name    population      population_fit  contamination   confidence_interval_95_width    confidence_interval_95_low      confidence_interval_95_high     sites
META    CEU     n/a     8.2     0.9     7.7     8.6     733

Here we can see that ContEst found that the file was approximately 8.2 percent contaminated, with a 95% confidence interval from 7.7 to 8.6.