Cookbook: Running Genome STRiP on the Amazon cloud

This document explains one method to install and configure GenomeSTRiP to run on the Amazon cloud.

Step 1: Sign up for an Amazon EC2 account (see http://aws.amazon.com/ec2/)

Step 2: Install the StarCluster Amazon cluster management software on your local network (see http://star.mit.edu/cluster/)

Step 3: Create a StarCluster config file

Generate a new Amazon EC2 keypair. As an example, to generate a keypair called gs- keypair, run

starcluster createkey gs-keypair -o ~/.ssh/rsa-gs-keypair

Create a StarCluster directory (by default, ~/.starcluster) and download the Genome STRiP sample StarCluster config file by running

wget ftp://ftp.broadinstitute.org/svtoolkit/aws/config -P ~/.starcluster

Then edit required information as described in Step 4

Alternatively, create a StarCluster default config file from scratch (see http://star.mit.edu/cluster/docs/latest/manual/configuration.html). By default, it will create the file named ~/.starcluster/config

The GenomeSTRiP plugin for StarCluster is used to dynamically install a specific version of GenomeSTRiP when a cluster gets launched. Download and install this by running

wget ftp://ftp.broadinstitute.org/pub/svtoolkit/aws/SVToolkitInstaller.py -P ~/.starcluster/plugins/

Step 4: Configure StarCluster

Modify the [aws info] section of the config file to fill in the AWS credentials information for your EC2 account.

Add the keypair section for the EC2 keypair you have created:

[key gs-keypair]
KEY_LOCATION=~/.ssh/rsa-gs-keypair

Add a section describing the SVToolkit plugin:

[plugin gs-plugin]
SETUP_CLASS = SVToolkitInstaller.SetupClass
email = your@email
SVVersion=

You have to fill in the e-mail address that you provided during Genome STRiP web site registration.
If you don't specify SVVersion, the plugin will download and install the latest available SVToolkit version.
Add a section describing the Genome STRiP cluster:

[cluster gs-cluster]
KEYNAME = gs-keypair
PLUGINS = pkginstaller,gs-plugin

Save the changes to the StarCluster config file.

Step 5: Launch a 1-node Amazon test cluster

Launch a cluster that will consist of the master node only, by running:

starcluster start gs-cluster -s 1

After this command completes, you can login to the master node by running:

starcluster sshmaster gs-cluster

The SVToolkit version you selected should be installed in /home/svtoolkit/.

Step 6: Check the Genome STRiP installation

You can verify the Genome STRiP installation using

java -jar ${SV_DIR}/lib/SVToolkit.jar

which will print version information.

You can also run the installtest on the cluster to further validate the installation.

See SVToolkit Recipies for more details on how to run various pipelines.

Step 7: Working with the cluster

Before running large analyses, you might want to logout from the master node and add more nodes to the cluster. For example, running

starcluster addnode -3 gs-cluster

will add 3 more nodes to gs-cluster. These additional instances are automatically added to the GridEngine host list, so the next time you login to the master node and run a pipeline that submits jobs to GridEngine all the instances in the cluster will be available to accept these jobs.

After you are done using the cluster you should terminate it by running

starcluster terminate gs-cluster