How do I run ContEst?The tool is run using Java; the command to execute the tool looks like: java -jar ContEst.jar -T Contamination The required command-line arguments for the tool:
Common, optional parameters include:
ContEst is a Java tool, based on the Genome Analysis toolkit (GATK), and many of it's inputs are processed through the GATK's engine;To get more information on how to run the tool, you can run the following the command: java -jar ContEst.jar -T Contamination -h which produces the following output (along with help using the GATK in general): ... Arguments for Contamination: -o,--out <out> An output file presented to the walker. Will overwrite contents if file exists. --trim_fraction <trim_fraction> what fraction of sites with highest and lowest likelihood values to trim -llc,--lane_level_contamination <lane_level_contamination> set to META (default), SAMPLE or LANE to produce per-bam, per-sample or per-lane estimates -sn,--sample_name <sample_name> The sample name; used to extract the correct genotypes from mutli-sample truth vcfs -pc,--precision <precision> the degree of precision to which the contamination tool should estimate (e.g. the bin size) -vs,--verify_sample should we veriy that the sample name is in the genotypes file? -mbc,--minimum_base_count <minimum_base_count> what minimum number of bases do we need to see to call contamination in a lane / sample? -pop,--population <population> evaulate contamination for just a single contamination population ... Example ContEst CommandAn example data package is available from the download page. You'll also need to download the 1000 genomes B37 reference file and the associated fai file: To run the example, you'll need to have downloaded the ContEst binary zip file and the hg19_population_stratified_af_hapmap_3.3.vcf.gz to your system. The example data is based on two low contamination level 1000 genomes samples, mixed together. The command to run is: java -Xmx2g -jar <ContEst_JAR_Location>/ContEst.jar \ This example will produce an output file which should look like the following: name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites Here we can see that ContEst found that the file was approximately 8.2 percent contaminated, with a 95% confidence interval from 7.7 to 8.6. |