Installation manual

Installation of PathSeq

1. Essential tool and database:
Repeat masker is an essential tool to run the PathSeq pipeline to remove low complexity reads from the input sequencing reads dataset. It requires Cross match and Repbase as alignment tool and database for repeatmasker. Repeat masker is important component in the pipeline to obtain reliable results from the input dataset processing.
Cross match and Repbase are not included in the package to run PathSeq on the cloud because of license requirements from the original authors. However, repeat masker is been installed in the Amazon machine image available for public.
For more information on obtaining licenses for Cross match and Repbase, please follow the URL:

    1. Cross match:
    Cross match is a general purpose utility for comparing any 2 DNA sequence sets using a version of swat, which is a program for searching one or more DNA or protein query sequences, or a query profile, against a sequence database.
    Step 1: How to get license for Crossmatch http://www.phrap.org/consed/consed.html#howToGet
    Step 2: Different types of licenceses:
    For Academic users:
    Fill the license agreement and send it to Prof. Phil Green (phg@u.washington.edu).
    For commercial users:
    http://depts.washington.edu/uwc4c/express-licenses/assets/phred-phrap-consed-autofinish/

    2. Repbase:
    Repbase is a database of prototypic sequences representing repetitive DNA from different eukaryotic species. This database is an essential component to run Repeatmasker.

Note:
1. Please make sure that repeatmasker and Repbase in .tar.gz and .tar.Z file format respectively.
2. PathSeq can be used without Cross match and Repbase. However, the results are not reliable and Pathseq runs may results in higher CPU time.

2. PathSeq package:
1. Create a PathSeq folder in your local directory    mkdir PathSeq
2. Download the PathSeq package into Pathseq folder from here.
2. Extract the PathSeq package using following command:
   Unzip pathseq_cloud.zip
3. Configure the cluster.config in $LOCAL_DIRECTORY/PathSeq/ with the following parameters noted in Step 3 of installation manual

    #Environment variables for running PathSeq on Amazon EC2 here. All are required.
    ####################################################################
    #Information on Amazon Account Number
    ####################################################################
    # Your Amazon Account Number.
    AWS_ACCOUNTID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # Your Amazon AWS access key.
    AWS_ACCESS_KEYID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    #Your Amazon AWS secret access key.
    AWS_SECRET_ACCESSKEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # The EC2 key name used to launch instances.(Don't Edit)
    KEYNAME=pathseq-keypair

    # The Amazon S3 bucket where the Hadoop AMI is stored.(Edit)
    # so you can store it in a bucket you own.
    IMAGEBUCKET=ami-pathseqimageXXXXXXXXXXXXXX

    # The EC2 instance type: m1.small, m1.large, m1.xlarge (Don't Edit)
    INSTANCETYPE=m1.large

    #######################################################################
    #Java, EC2 keypair, S3Cmd, Hadoop path location
    #######################################################################
    # Java path (sample:/local/dirctory/jdk1.6.0_20/bin )
    JAVAPATH=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # Location of EC2 keys. (*.pem and id-$KEYNAME) (sample: /local/dirctory/ec2).
    # Please make sure that public key, private key and keypair file in the ec2 folder.
    EC2KEYDIR=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # S3CMD Location (sample:/local/directory/s3cmd-0.9.9.91)
    S3CMDLOC=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # Hadoop-ec2 path (Don't Edit)
    HADOOPPATH=./bin

    #######################################################################
    #REPEAT MASKER LOCATION (INFORMATION ESSENTIAL FOR AMI CREATION)
    #######################################################################
    # Repeat masker available (Y/N)
    REPAVAIL=xxxxxx

    # Cross match location (sample: /local/directory/repeatmasker/repeatmaskerlibraries-20090604.tar.gz)
    REPBASELOC=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    # Repbase location (sample: /local/directory/repeatmasker/distrib.tar.Z)
    CROSSMATCHLOC=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    #####################################################################