Converting Birdseed to VCF

ContEst, the contamination estimation tool, requires a VCF as input if you want to provide known genotypes from an array.  To convert Birdseed output (the most commonly used tool for processing raw array CEL files into genotypes) into a VCF we've provided a simple python script.  


The tool requires one non-standard Python module: pyfasta.  You should install this before proceeding, and check that the module is visible on your python classpath.

Other files that are need:

  • Genome annotation file for your array type.  Gzipped, Python-pickled version for Affy SNP6 are available here for hg18 and hg19.
  • A fasta file for the reference that your arrays were run on (be careful to match the two).   Pyfasta will index this, which takes a few minutes, the first time the tool is run.
  • The tool itself, a pretty simple python script, is available here.

Running the Tool

The parameters for the tool can be found by typing the following python -h, which displays the help information:

usage: [-h] --birdseed BIRDSEED --output_vcf OUTPUT_VCF
                        --array_sample ARRAY_SAMPLE --vcf_sample VCF_SAMPLE
                        --snp_annotation_file SNP_ANNOTATION_FILE --fasta
                        FASTA [--add_chr] [--one_based_annotations]
                        [--dont_convert_MT_to_M] [--show_dropped_sites]
                        [--contig_ordering CONTIG_ORDERING]
Convert a birdseed call file to a VCF for contamination input
optional arguments:
  -h, --help            show this help message and exit
  --birdseed BIRDSEED   the birdseed file, containing SNP positions and their
                        genotype call column
  --output_vcf OUTPUT_VCF
                        the output vcf file
  --array_sample ARRAY_SAMPLE
                        the sample name to look for in the array (probe) file,
                        this is converted to the output sample name for the
  --vcf_sample VCF_SAMPLE
                        the sample name used for the VCF file; this should be
                        the same sample used in the BAM file in the
                        contamination estimation
  --snp_annotation_file SNP_ANNOTATION_FILE
                        the annotation file, describing the SNP positions and
                        their alleles from the array manufacturer
  --fasta FASTA         the reference fasta file
  --add_chr             sometime the reference sequence uses chr, where the
                        arrays dont, in this case you want to add a 'chr' to
                        the array contig name
                        annotation files are most likely zero based, if this
                        flag is set we assume its one based
                        convert the mitochondrial name of MT to M
  --show_dropped_sites  if we drop a site due to missing information, output a
  --sort_on_the_fly     sometimes probes can come out of order, depending on
                        the reference. If this flag is set we store every
                        record in memory, and sort when outputting (large
                        memory req!)
  --contig_ordering CONTIG_ORDERING
                        if the contigs have a canonical ordering, we'll order
                        the contigs by the ordering in this file. Each contig
                        should be on a seperate line, in the order you'd like


An example command line would look like the following:

python \
--birdseed birdseed.calls.txt \
--output_vcf GBM-Native-02-0047-Normal.vcf \
--snp_annotation_file GenomeWideSNP_6.na30.annot.hg18.csv.pickle \
--array_sample TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834 \
--vcf_sample GBM-Native-02-0047-Normal \
--fasta ~/fasta/hg18.fasta \

In the above example, we're taking the birdseed.calls.txt as input, and outputting a file called GBM-Native-02-0047-Normal.vcf.  Since this an hg18 sample (both the birdseed and the pickle file are hg18), we add the --add_chr command, since our birdseed output used simple numbered chromosomes (e.g. 1 instead of chr1), whereas our reference uses the chr1 convention.  Since our samples are named differently in the birdseed file, we use the --array_sample parameter to specify the name in the birdseed file, as well as the --vcf_sample to indicate what we'd like to be in the output VCF.  This is important, as your sample name here must match what's specified in the BAM file for ContEst to run correctly (there are ways around this, but why not get it right now).

Sample output from a run should look like the following:

python --birdseed example_calls_small.txt --output_vcf my_sample.vcf --snp_annotation_file GenomeWideSNP_6.na30.annot.hg18.csv.pickle --array_sample FADGE_p_TCGA_b100_SNP_N_GenomeWideSNP_6_C08_747496 --vcf_sample MY_SAMPLE --fasta ~/fasta/hg18.fasta

[Status] loading the FASTA file...
[Status] loading the annotation file...
[Status] loading from pickled annotation file...


Contig (chromosome) ordering

You may run into trouble when the ordering of the contigs in your birdseed file are different than your genome build Fasta file; the best way to handle this is to use the --contig_ordering to provide a reordering of the output contigs to match that of the  Fasta file.  This file is simply a list (one chromosome name per line), indicating the order of contigs you'd like to have in the output VCF, regardless of what was seen in the input birdseed file.  Contigs not listed in this file will be omitted from the output, even if they're in the input birdseed file.