Cookbook: Genotyping A Novel Site In 1000 Genomes Phase 1

This recipe demonstrates how to assess a novel site (or a small number of novel sites) using the 1000 Genomes data. This recipe is useful if you have a suspected large deletion variant and want to know if there is any evidence for this variant in the 1000 Genomes populations, and if so, what is the population distribution. This recipe uses a local installation of Genome STRiP but runs against a remote copy of the 1000 Genomes data.

Step 1: Download and install a local copy of Genome STRiP. See Installing Genome STRiP.

We will refer to the installation directory as SV_DIR. Code used in this recipe is in ${SV_DIR}/cookbook/genotyping/1000G_phase1_local.

Step 2: Download and install the 1000 Genomes pre-computed metadata.

​This recipe accesses the 1000 Genomes bam files remotely, but uses a local copy of pre-computed metadata for performance. The pre-computed metadata requires about 20 Gb of disk space. It includes the reference sequence and local copies of the bam index files. The metadata is available on the Genome STRiP ftp server. You can download and unpack it with the following commands.

cd /metdata/download/directory
wget ftp://ftp.broadinstitute.org/pub/svtoolkit/public_metadata/1000G_phase1_20101123_mdv1.tar.gz -P .
tar -zxvf 1000G_phase1_20101123_mdv1.tar.gz

We will refer to the metdata directory as SV_METADATA_DIR.

​Step 3: Prepare an input VCF file describing the sites you want to genotype.

An example VCF file is in example/1000G_MERGED_DEL_2_99615.vcf.

The POS field describes the start coordinate of each deletion and the END tag in the INFO field describes the end coordiante. The INFO fields SVTYPE tag must be set to DEL for deletion genotyping.

You can prepare the input VCF file using a text editor, but the file must be tab-delimited (no spaces).

​Step 4: Set environment variables.

The example scripts for this recipe set environment variables through the script scripts/set_sv_params.sh. You should edit this file to set up the following environment variables:

SV_DIR:  The installation directory for the SVToolkit code.

SV_TEMPDIR:  The directory to use for temporary files.

SV_METADATA_DIR:  The directory where the downloaded metadata is stored.

​Step 5: Genotype your sites.

You can genotype the site in the example input VCF file using the following commands:

cd ${SV_DIR}/cookbook/genotyping/1000G_phase1_local
scripts/genotype_sites.sh example/1000G_MERGED_DEL_2_99615.vcf rundir

The arguments to genotype_sites.sh are the input VCF file (which specifies the set of sites to genotype) and an output run directory (where the results and intermediate files are created). You should use distinct run directories for each separate set of input sites (each input VCF file) to avoid file name conflicts.

The genotype_sites.sh script creates the following primary output files:

rundir/1000G_MERGED_DEL_2_99615.genotypes.vcf:  The output VCF file with genotypes. See Interpreting VCF Output.

rundir/1000G_MERGED_DEL_2_99615.genotypes.pdf:  A set of "genotyping plots" showing information about the genotypes at each site. See Understanding Genotyping Plots.

You can compare the results of genotyping the example site to the files in the baseline sub-directory. If you are using a newer version of Genome STRiP, there may be small differences between the values in the output VCF file and the version in the baseline directory.