Genome STRiP example workflow for C4 A/B analysis in Terra

Genome STRiP example workflow for C4 A/B analysis in Terra

We have created a public workspace in Terra to illustrate how to run a copy number analysis of C4 using Genome STRiP. The demonstration workspace contains a WDL workflow that can be run on the 1000 Genomes high coverage data. You can clone the demonstration workspace to run the analysis yourself on the 1000 Genomes data or export the workflow to your own Terra workspace to run the same analysis on your data.The Terra workspace is available here (login required):


The complement component 4 (C4) gene is part of the complement pathway in the human innate immune system. Variation in C4 has been associated with risk for schizophrenia¹, systemic lupus erythematosus (SLE or “lupus”)² and Sjögren’s syndrome².

In humans, the C4 gene is copy-number variable and exists in a variety of allelic forms. Humans typically have between 2 and 8 total copies of the C4 gene (as opposed to two copies for most genes). Each of these copies may be one of the two forms called C4A or C4B which are distinguished by a sequence of 6 amino acids at an active binding site. In addition, each copy of C4 may be either “long” or “short” as determine by the presence or absence of a endogenous human retroviral DNA sequence (HERV) that is present within intron 9 of some copies of the gene. These different numbers and forms of C4 exist on a large variety of different structural haplotypes at the C4 locus on chromosome 6 in the HLA region¹.

Because of the complexity of this region and the high degree of variability in humans, standard automated analysis pipelines generally do not provide full, accurate characterization of the C4 genotypes in each individual. This workflow contains a custom analysis pipeline, built on top of the Genome STRiP software³, to analyze C4 from whole-genome sequencing data. This analysis pipeline is the basis for our continuing research in the McCarroll Lab at Harvard Medical School on C4 and its association with schizophrenia and other diseases.

This demonstration workspace is set up to be able to run the C4 analysis workflow on the 1000 Genomes⁴ high-coverage WGS data, available from the NHGRI AnVIL⁵. The workflow is designed to run in batches of approximately 100 samples each. For this demonstration workspace, we have set up 26 batches (sample sets) corresponding to each of the 1000 Genomes populations. Running the workflow on one population takes less than an hour and typically costs one or two cents.

Workspace Data

The data used in this demonstration workspace is high-coverage whole-genome sequencing from the 1000 Genomes project. The underlying cram files are stored in the AnVIL.

Running the example workflow for C4 A/B analysis in Terra

The gs_analyze_C4AB_batch workflow operates on a sample set. To run the workflow, select one of the sample sets corresponding to a 1000 Genomes population from the Terra user interface on which to run. These sample sets are named md_batchN. The workflow is designed to run on sample sets of approximately 100 to 200 samples. The workflow will not run on the sample set that represents the entire cohort (“1000G-high-coverage-2019-all”) without modifications to increase the resources used by the workflow.

Typical runtime is around 45 minutes and the typical cost is one or two cents.

The workflow will analyze the copy-number variable segments at the C4 locus, including the copy number of the C4 genes and the HERV segment. The workflow will also analyze the reads mapping to the C4A/C4B active site and estimate the number of copies of C4A and C4B in each sample.

In this demonstration workspace, we use precomputed Genome STRiP metadata for the 1000 Genomes project (which we host in a bucket on Google cloud storage). To run this workflow on your own data set, you will first need to run standard Genome STRiP preprocessing and group your samples into analysis batches of approximately 100 samples each. See the Genome STRiP documentation for more details.

Workflow Output

The workflow produces a .zip file with multiple output files. The output files produced are:


This is the primary output file. It is a tab delimited summary file with the following columns:

Column Description
SAMPLE Sample identifier (from the input cram file)
C4 Total C4 copy number (integer value)
CNF_C4 Internal fractional estimate of C4 copy number
CNQ_C4 Phred-scaled quality value for the C4 copy number estimate
HERV Total HERV copy number (integer value)
CNF_HERV Internal fractional estimate of HERV copy number
CNQ_HERV Phred-scaled quality value for the HERV copy number estimate
C4A Estimate of total number of C4A genes
C4B Estimate of total number of C4B genes
C4A1 Estimate of total number of non-classical C4AB gene of type C4A1
C4A2 Estimate of total number of non-classical C4AB gene of type C4A2
C4B1 Estimate of total number of non-classical C4AB gene of type C4B1
C4R1 Estimate of total number of non-classical C4AB gene of type C4R1
AB_QUAL Phred-scaled quality value for the C4A/C4B copy number estimates
COUNTS The number of reads supporting each C4A/C4B estimate (comma separated).


This is a VCF file containing the copy number estimates for the C4 gene and the HERV segments, the VCF is generated using Genome STRiP genotyping of two CNV segments: CNV_C4_SEGABC_M2 gives the total C4 copy number (without the HERV) and CNV_HERV_M2 gives the total HERV copy number.


This PDF file contains plots of the copy number distributions of the C4 and HERV segments, based on the information in the vcf file.


This tab-delimited file contains a list of the kmers observed at the C4A/C4B active site. If the kmer is present in the known_kmers file, it will be identified based on the information in that file. Novel kmers will also be included in this file, although many kmers may also be listed that are caused by occasional sequence errors, not true genetic variation. See the section on novel C4 A/B alleles for more details.


This tab-delimited file describes the known / expected sequence variation at the C4A/C4B active site. If you encounter a novel C4 A/B allele in your data set, you can provide an updated known_kmers file to analyze the unique allele. See the section on novel C4 A/B alleles for more details.


This tab-delimited file lists all of the kmers observed at the C4A/C4B active site and includes information about the sequencing read and sample where each kmer was observed. This file is mainly useful for detailed assessment or validation of novel sequence variation at the active site.

Classical and Non-classical C4 A/B Alleles

There are two main forms of the C4 gene, C4A and C4B, which are distinguished by 4 amino acid substitutions are a 4 amino acid active site on exon 26. As we study the C4 gene in larger and more diverse populations, however, we observe other (rare) DNA variation at the active site that creates copies of the C4 gene that are neither the classical A or B alleles.

To date, we have characterized four novel alleles of C4 that are segregating in the human population. Three of these alleles are present in the 1000 Genomes samples. In the 1000 Genomes cohort, each individual that carries a non-classical C4 allele carries that allele in only one of their copies of C4 and the individual has other copies of C4 with the classical A/B alleles as well.

The DNA sequences for the (known) non-classical alleles are given in the file C4_hg38.known_kmers.txt (see the workflow output section). The alleles named C4A1 and C4A2 are closer to the classical C4A allele (one amino acid change) while the allele named C4B1 is closer to the classical C4B allele. The C4R1 allele is (at the sequence level) a combination of C4A and C4B.

Novel C4 A/B Alleles

This workflow does automated analysis of the known C4 A/B alleles listed in the known_alleles input file. In addition, the workflow characterizes all observed kmers at the active site to facilitate the identification of novel alleles.

The output file C4_hg38.kmer_analysis_table.txt.gz lists all kmers observed at the active site. If the kmer is present in the known kmers file, it will be labeled with the identifier from the known kmers file. Other kmers are labeled as NA. Most of these unknown kmers will be due to occasional sequencing errors and do not represent true genetic variation.

The workflow runs an analysis of all observed kmers to determine whether they are likely to result from true genetic variation (as opposed to sequencing error). True genetic variation is modeled as a non-random occurrence of a kmer in a subset of samples compared to a null model where observations of a kmer are from a Poisson-distributed model with a sequencing error rate estimated from the observed data. The results of this analysis are listed in the kmer_analysis_table file, which has the following columns:

Column Description
KMER The DNA sequence at the active site
LABEL The label from the known_kmers file, or NA if this is a novel kmer
COUNT The number of reads where this kmers was observed (across all samples)
ERATE The estimated sequencing error rate at this base position
MAXLOD The LOD score from the sample with the highest LOD score for this kmer
DIST The distribution of COUNT for the top 10 samples with the highest counts

The table is sorted in decreasing order of MAXLOD. The MAXLOD score is the estimate of whether this kmer represents a true novel allele. In most cases, the classical C4A and C4B alleles will appear at the top of the table as confidently assigned true alleles. A row with a negative MAXLOD (or a small positive MAXLOD) is most likely due to sequencing error. A true allele would be observed multiple times (depending on sequencing depth) in a small number of samples. If you discover a novel C4 allele, please contact us so that we may add it to the known allele table.

If you discover a novel allele in your cohort, you can add this allele to the known_alleles file (it is an optional workflow input) and rerun the analysis. This will add the novel allele to the C4_table and generate genotypes based on the novel allele. The genotype calculations in the C4_table assume a fixed set of true alleles, defined by the known_alleles file.


  1. Sekar, A. et al. Schizophrenia risk from complex variation of complement component 4. Nature 530, 177–183 (2016)

  2. Kamitaki, N. et al. Complement component 4 genes contribute sex-specific vulnerability in diverse illnesses. Biorxiv (2019)

  3. Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat Genet 47, 296–303, doi:10.1038/ng.3200 (2015)

  4. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68-74, doi:10.1038/nature15393 (2015)

  5. National Human Genome Research Institute,