Cookbook: Genotyping a novel site in 1000 Genomes Phase 1 using AWS

Step 1: Configure a Genome STRiP AWS cluster as described in Running Genome STRiP on the Amazon cloud.

Step 2: Create and mount an EBS drive to hold the 1000 Genomes pre-computed metadata.

If you plan to use an existing public data set hosted on AWS, such as 1000 Genomes Phase 1, create an EBS volume of the appropriate size to store the Genome STRiP metadata for this data set (for general instructions on managing EBS volumes through StarCluster see http://star.mit.edu/cluster/docs/0.93.3/manual/volumes.html).

The volume size requirement will differ depending on what dataset you choose to use.  For example, the metadata size for 1000G Phase 1 is slightly less than 20Gb. Here is an example command to create a 20Gb EBS volume that will be used to store the 1000G Phase 1 metadata:

starcluster createvolume --name=1kg_phase1_md --shutdown-volume-host 20 us-east-1

The command will produce output that looks like this

>>> Your new 20GB volume vol-aabbccdd has been created successfully

Add a section to the StarCluster config file describing the newly created volume:

[volume gs-volume]
VOLUME_ID = vol-aabbccdd
MOUNT_PATH = /gs_metadata

Then add this volume to the section in the config file describing the Genome STRiP cluster:

[cluster gs-cluster]
KEYNAME = gs-keypair
PLUGINS = pkginstaller,gs-plugin
VOLUMES = gs-metadata

Save the changes to the StarCluster config file.

Step 3: Launch a 1-node Amazon cluster.

Launch a cluster that will consist of the master node only, by running:

starcluster start gs-cluster -s 1

After this command completes, you can login to the master node by running:

starcluster sshmaster gs-cluster

SVToolkit should be installed in /home/svtoolkit/, and the volume you created should be mounted on /gs_metadata.

Step 4: Copy the pre-computed GenomeSTRiP metadata to the EBS drive mounted on /gs_metadata.

Here is the table listing all the available SVToolkit metadata sets:

Metadata Set Name URL Location
1000G Phase 1 ftp://ftp.broadinstitute.org/pub/svtoolkit/public_metadata/1000G_phase1_...

For example, to download 1000G Phase 1 metadata, run

wget ftp://ftp.broadinstitute.org/pub/svtoolkit/public_metadata/1000G_phase1_20101123_mdv1.tar.gz -P /mnt
tar -zxvf /mnt/1000G_phase1_20101123_mdv1.tar.gz -C /gs_metadata

Tip: If you plan to run Genome STRiP using the same dataset in the near future, keep the EBS volume you created and populated with the metadata, so that the next time you launch gs-cluster the metadata volume will be automatically mounted on /gs_metadata and you won't have to re-download the metadata.
 

Step 5: Follow the instructions in Genotyping a novel site In 1000 Genomes Phase 1 (local).

When you set environment variables, set SV_METADATA_DIR to /gs_metadata/1000G_phase1_20101123_mdv1.