How do I run HAPSEG?
To run HAPSEG, you'll need R version 2.12.0 or higher with the following packages installed:
These packages can be installed from the R command line with the commands:so source("http://www.bioconductor.org/biocLite.R") biocLite(c("Rcpp", "numDeriv", "foreach", "geneplotter")) You will also need to install the HAPSEG package supplied in the release bundle. The subdirectory R_packages contains both source and Windows binary versions. Instructions on how to install these packages can be found at http://cran.r-project.org/doc/manuals/R-admin.html#Installing-packages. Some portions of HAPSEG can be parallelized via the use of the foreach package. This can be done before running by registering a foreach parallel backend. Please see the foreach documentation on how to do this if you would like this functionality. The HAPSEG package also requires a Java 6 runtime environment, with the command java available in the user's PATH. You will also need the following files to process your sample:
To run HAPSEG with all optional arguments using defaults, the following call is made:
RunHapSeg(plate.name, array.name, seg.fn, snp.fn, genome.build, results.dir, platform, use.pop, impute.gt, plot.segfit, merge.small, merge.close, min.seg.size, normal, out.p, seg.merge.thresh, use.normal, adj.atten, phased.bgl.dir)
An explanation of these arguments follows:
There are also several optional arguments. To use this, specify them by name e.g. verbose=TRUE.
Input File OptionsAs mentioned above, some input files take an optional argument to tell HAPSEG how to parse them. For the snp.fn argument the default situation assumes a tab delimited file with two columns named A and B with the rownames corresponding to the chip's probeset IDs. This file can either be a text file or a saved RData file containing that data as the object dat. The FullSnpFileParser parser can be used if the user is inputting a .snp file created by SNPFileCreator. Lastly the AptSnpFileParser parser is provided for those users using the Affymetrix Power Tools. This mechanism allows HAPSEG to support any input file that can be made to conform to the default structure, users can write & pass in a function to handle any arbitrary situation. For the clusters.fn file the default parser also assumes a tab delimited file with rownames being the probeset IDs. In this case there are 6 columns: AA.a, AB.a, BB.a, BB.b, AB.b and AA.b. This file structure is the same as what the SNPPipeline outputs as birdclusters.RData. An alternate parser is supplied, BirdseedClustersFileParser which assumes that the input is a raw clusters file output by Birdseed. As with snp.file.parser, any arbitrary format can be supported via the use of user supplied functions. Using Affy Power ToolsFor users outside of the Broad, Birdseed is available via the Affymetrix power tools with documentation available here. To process APT data properly you will need HAPSEG version 1.1. To get APT data working with HAPSEG, you'll need to run both apt-probeset-summarize and apt-probeset-genotype. As an example, assume that the APT binaries are installed in ./bin, the library files are in ./LibFiles and your CEL files are in ./CEL. First run the following (You might want to tweak the arguments to --analysis). apt-probeset-summarize
./bin/apt-probeset-summarize --cdf-file LibFiles/GenomeWideSNP_6.cdf --analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true --out-dir aptSnpOutput ./CEL/*.CEL The important output file for this command will be aptSnpOutput/quant-norm.pm-only.med-polish.expr.summary.txt (note that this filename will differ if you changed the --analysis argument). This file will be your snp.fn argument to HAPSEG. Next you will need to run birdseed: ./bin/apt-probeset-genotype -o birdseed -c ./LibFiles/GenomeWideSNP_6.cdf --set-gender-method cn-probe-chrXY-ratio --chrX-probes ./LibFiles/GenomeWideSNP_6.chrXprobes --chrY-probes ./LibFiles/GenomeWideSNP_6.chrYprobes --special-snps ./LibFiles/GenomeWideSNP_6.specialSNPs --read-models-birdseed ./LibFiles/GenomeWideSNP_6.birdseed.models -a birdseed-v1 --write-models ./*.CEL This will provide two output files of note, birdseed/birdseed-v1.calls.txt and birdseed/birdseed-v1.snp-models.txt which will be your calls.fn and clusters.fn arguments, respectively. Lastly you will need to specify snp.file.parser=AptSnpFileParser and cluster.file.parser=BirdseedClustersFileParser. If you are using APT, it is likely that you will not have an external segmentation file for your samples. If this is the case make sure to use the value NULL for the seg.fn argument. ExampleUsing the example data in the release bundle we can run the data to recreate the mixing results as follows. This can be run on a multicore system by uncommenting the registerDoMC call and specifying the number of cores that you wish to use. Besides ABSOLUTE this code also requires the use of foreach and optionally doMC. This code will create a directory output which will contain a per-sample subdirectory containing all output as well as a subdirectory hapseg_logs which will contain per-sample text files containing the standard out being emitted from R. DoHapseg <- function(scan, sif) { registerDoSEQ() library(HAPSEG) plate.name <- "DRAWS" snp.file.base <- "./paper_example" snp.file.post <- "snp_byAllele.RData" seg.fn <- "./paper_example/mix250K_seg_out.txt" clusters.fn <- "./paper_example/birdclusters.RData" results.dir.base <- "./output" phased.bgl.dir <- "./paper_example/phasedBGL/hg18" genome <- "hg18" platform <- "SNP_250K_STY" pop <- "CEPH" impute.gt <- FALSE plot.segfit <- TRUE merge.small <- TRUE merge.close <- TRUE min.seg.size <- 5 normal <- FALSE out.p <- 0.001 seg.merge.thresh <- 1e-10 use.normal <- TRUE adj.atten <- FALSE snp.fn <- file.path(snp.file.base, paste(scan, snp.file.post, sep=".")) sif.info <- as.character(sif[scan, ]) results.dir <- file.path(results.dir.base, scan, "hapseg") if (!file.exists(results.dir)) { dir.create(results.dir) } print(paste("Starting scan", scan, "at", results.dir)) log.dir <- file.path(results.dir.base, "hapseg_logs") if (!file.exists(log.dir)) { dir.create(log.dir) } sink(file=file.path(log.dir, paste(scan, ".hapseg.out.txt", sep=""))) RunHapSeg(plate.name, scan, seg.fn, snp.fn, genome, results.dir, platform, pop,impute.gt, plot.segfit, merge.small, merge.close, min.seg.size, normal, out.p, seg.merge.thresh, use.normal, adj.atten, phased.bgl.dir, calibrate.data=TRUE, clusters.fn=clusters.fn, drop.x=TRUE, verbose=TRUE) sink() } arrays.txt <- "./paper_example/mix250K_arrays.txt" sif.txt <- "./paper_example/mix_250K_SIF.txt" ## read in array names scans <- readLines(arrays.txt)[-1] sif <- read.delim(sif.txt, as.is=TRUE) library(foreach) ##library(doMC) ##registerDoMC(20) foreach (scan=scans, .combine=c) %dopar% { DoHapseg(scan, sif) }
|