VCF

VCF stands for Variant Call Format, and it is used by the 1000 Genomes project to encode structural genetic variants. See Viewing Variants for example IGV visualizations of mutation and VCF files.

  • Variant calls include SNPs, indels, and genomic rearrangements.
  • Samples may also be annotated with attribute information, including pedigree and family information. IGV uses these annotatations to group, sort, and filter samples, e.g. to group samples by population group.

A consistent color sheme is used in the variant display row, which is the top row, for files with or without geneotypes.

  • blue - minor allele frequency/fraction is known from annotation or genotype data
  • grey - minor allele frequency is not known
  • red - height is proportional to minor allele frequency

Required Extensions: .vcf, .vcf.gz

If the file is gzipped (ends with .vcf.gz), it must have an accompanying Tabix index (see below).

VCF Requirements

IGV supports VCF Version 4.

VCF data files must be indexed for viewing in IGV, either by using igvtools or by using Tabix. 

  • igvtools can be run from the command line or IGV itself (Tools>Run igvtools...)  After launching, choose the Index command and browse to your .vcf file. The index file (.idx) will be created in the same directory as the .vcf file.
    • igvtools also sorts .vcf files.
  • Tabix creates a .tbi file.  Tabix, including documentation, is available from the SamTools Web site.  

VCF Specification

Example V.4.0 File:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

This example shows in order:

  • A good, simple SNP
  • A possible SNP that has been filtered out because its quality is below 10
  • A site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error)
  • A site that is called monomorphic reference (i.e., with no alternate alleles),
  • A microsatellite with two alternative alleles, one a deletion of 3 bases (TCT), and the other an insertion of one base (A).

Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth, and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.