Reference Genome Metadata

Reference Genome Metadata

Genome STRiP uses sequeunce reads that have been aligned to a reference genome. During processing, Genome STRiP requires the reference genome (in indexed fasta format) and in addition utilizes several forms of pre-computed information based on the reference genome (independent of the data set being analyzed).

Starting with Genome STRiP 2.0, we have reorganized our use of this "reference metadata". The reference metadata is now packaged into a single directory, the reference metadata directory, which contains the reference genome and all of the reference-associated files needed by Genome STRiP. It is recommended that you use one of the pre-computed reference metadata bundles that are available as tarballs from our web site ftp://ftp.broadinstitute.org/pub/svtoolkit/reference_metadata_bundles.

  • HG38 (GRCh38)
  • Also known as Homo_sapiens_assembly38, this is the human GRCh38 reference genome. For the new genome build, there has been standardization across the major sequencing centers worldwide on this reference sequence. In addition, the major sequencing centers have standardized on a set of "functionally equivalent" pipelines for sequence alignment, making it easier to analyze data produced at different sequencing centers together.

  • HG19 (Broad version)
  • Also known as Homo_sapiens_assembly19, this is the HG19 reference genome used by default at the Broad Institute. If your sequencing data was generated at Broad, then this is most likely the reference genome your data was aligned to. This genome corresponds closely to the GRC version of HG19, but also includes the Epstein-Barr virus (contig NC_007605).

  • 1000 Genomes Phase 3
  • Also known as human_g1k_hs37d5, this is the HG19-based reference genome used by the 1000 Genomes Project for Phase 3 analysis (the final phase of the project). This genome differs from the Broad HG19 genome principally due to the inclusion of 35 Mb of human sequence (the so-called "decoy sequence") that is included as an extra contig named hs37d5. This reference genome also contains the Epstein-Barr virum (contig NC_007605).

    The reference metadata bundle for Genome STRiP includes a default genome alignability mask that uses a k-mer size of 70bp, corresponding to the shortest reads used in Phase 3 of the 1000 Genomes project.

  • 1000 Genomes Phase 1
  • Also known as human_g1k_v37, this is the HG19-based reference genome used by Phase 1 of the 1000 Genomes Project (PMID: 23128226). This genome is very similar to Broad HG19, but without the Epstein-Barr virus.

    The reference metadata bundle for Genome STRiP includes a default genome alignability mask that uses a k-mer size of 36bp, owing to the presence of some older sequencing data in 1000 Genomes Phase 1 that used 36bp reads.

Contents of the reference metadata bundles

Each reference metadata contains a copy of the reference genome sequence in indexed fasta format (a .fasta file and a .fasta.fai index file). In addition, the following files are also included in each reference metadata bundle:

  • reference.fasta
  • The reference genome sequence (and accompanying index file in reference.fasta.fai).

  • reference.ploidymap.txt
  • This file specifies the gender-dependent normal ploidy of each segment of the reference genome and is used in structural variation calling on the sex chromosomes. In humans, this file flags the Y chromosome as being present only in males and indicates that the X chromosome differs in normal copy number based on the sex of the individual (except for the pseudo-autosomal regions). The structure of this file is current human-centric.

  • reference.gendermask.bed
  • This file contains a set of regions that are excluded from automated gender-determination by default.

  • reference.svmask.fasta
  • A genome alignability mask (in indexed fasta format) that is used to deterimine whether reference sequence should be considered to be uniquely alignable.

    This mask is computed based on the reference genome and a default k-mer length and indicates whether a k-mer centered on each position of the reference genome has a unique mapping to the reference. The k-mer length is defaulted in each reference metadata bundle, but this mask can be overridden if necessary for specific data sets.

    This file is usually present in each reference metadata bundle as a symbolic link to a file whose name specifies the k-mer length used to compute the mask.

  • reference.gcmask.fasta
  • A genome mask (in indexed fasta format) that is used for gc-bias estimation in each sequencing library. This mask excludes region of the genome that have variable ploidy (e.g. sex chromosomes), regions that are in annotated segmental duplications, regions that have repeats flagged by repeat masker, regions annotated as containing polymorphic CNVs in DGV and miscellaneous reference contigs. In older versions of Genome STRiP, this file was called the "cn2mask" because it selects regions of the genome that are likely to be two copies in most people.

    There are also an accompanying index file (reference.gcmask.fasta.fai), a file with statistics about the number of masked and unmasked bases (reference.gcmask.fasta.stats), and for convenience a bed file with the equivalent mask information (reference.gcmask.fasta.bed).

  • reference.lcmask.fasta
  • A genome mask (in indexed fasta format) that masks low-complexity sequence and is used in read-depth analysis.

  • reference.rdmask.bed
  • A bed file specifying a set of regions to use for overall sequencing depth estimation. By default, this file contains all of the base autosomal sequence in the reference genome.

  • reference.vdjregions.bed
  • A bed file specifying a list of regions that are known to harbor somatic copy-number variation due to human V(D)J recombination. This file is used to flag such regions during CNV analysis.

  • reference.gcprof.zip
  • This file contains the distribution of windows across the reference genome stratified by G+C content. It is used for GC bias estimation during preprocessing.

  • reference.interval.list
  • File containing a list of intervals that is used by the CNV pipeline as the default list of intervals over which it will perform CNV discovery. If not present, all intervals in the reference genome will be used by default.

  • reference.dict
  • A picard-format dictionary file for the reference genome.

  • mdversion.txt
  • Used for version control, this file indicates the software compatibility of this reference metadata bundle.

File format for svmask, gcmask and rdmask fasta files

The mask files used by Genome STRiP can generally be provided in either fasta or bed format.

For the fasta format, the order and lengths of the sequences should correspond exactly to the order and lengths of the sequences in the corresponding reference fasta file. The entry at each base position is either 0 or 1, with 1 used to indicate positions which should be masked (skipped or omitted). For example, if the svmask is computed with an alignability threshold of 101bp, then the positions marked with 0 are in the center of unique kmers of length 101 on the reference that areare uniquely alignable to the core reference sequence (excluding alt contigs).

Building a custom reference metadata bundle

If you work with non-human organisms or use a human reference sequence for which there is no pre-computed reference metadata bundle, you will want to create most of the files present in a standard reference metadata bundle in order to use Genome STRiP.

In most cases, the software will try to use reasonable defaults when specific files are not present. For example, in the absence of a ploidy map, it is assumed that the entire reference genome is diploid in all individuals. The ploidy map (and the gender of each sample) are only used when calling variants on sex chromosomes. As another example, if a low-complexity mask is not available, CNV calling can still be performed, although the false discovery rate may be higher.

If you are using a human reference genome that is not listed, please let us know. We may be able to create a reference metadata bundle for your reference genome or give you advice on how to make simple modifications to a close-enough reference metadata bundle to get acceptable results.

If you do need to create your own reference metadata bundle, the following section gives some more specific guidance on how to create some of the more important files. It is assumed that you already have a reference genome in indexed fasta format.

  • reference.ploidymap.txt
  • This is a simple text file (whitespace delimited, no column headers). If this file is not provided, most of the code will assume a default ploidy of 2, although currently SVPreprocess requires a ploidy map.

    Currently, only organisms with a human-like XY sex determination system are fully supported. Here is an example based on human GRCh37:

    X  2699521  154931043  F  2
    X  2699521  154931043  M  1
    Y        1   59373566  F  0
    Y        1   59373566  M  1
    *        *          *  *  2
    

    The first column specifies a chromosome or sequence name. The next two columns are 1-based inclusive start and end positions of a region with uniform (possibly gender-dependent) ploidy. The fourth column indicates a gender (or * for any gender) and the fifth column gives the expected integral ploidy for the combination of region and gender. An asterisk in any column is interpreted as a wild card. Thus, the last line in the example indicates that all human chromosomes other than X and Y have an expected ploidy of 2. We do not currently try to handle mitochondrial DNA in the ploidy map.

  • reference.gendermask.bed
  • This is a bed file that is used to mask regions with unreliable read depth on human sex chromosomes. This is currently used only for automated gender estimation during preprocessing. You can optionally create such a mask manually to help with gender estimation if appropriate for your organism and reference genome.

  • reference.svmask.fasta
  • This alignability mask is computed based on the reference genome and a default k-mer length, using the Genome STRiP ComputeGenomeMask utility.

    Selecting a good k-mer length is currently a bit fiddly. The k-mer length used should be based on the shortest read length in the majority of your data set, although for human we have been satisfied with using a 100bp alignability mask for data sets with longer reads (e.g. 150 or 250bp). If most of your reads will be 100bp or longer, you can simply create and use a 100bp alignability mask. Using a slightly shorter k-mer mask (relative to your input data read lengths) is preferable to using one where the k-mer size is too long relative to most of the input data, however it also does not seem to matter much if only a small fraction of your input data has reads shorter than the target k-mer size.

  • reference.gcmask.fasta
  • If this file is not provided, GC-bias estimation will be performed using the entire genome. We have not investigated how important it is to filter the reference genome when estimating GC-bias.

    One tool you can use to help create this mask file is the BedFileToGenomeMask utility. You would first create a merged bed-file of all of the regions you want to exclude (regions of the genome known to have variable ploidy, sex chromosomes, miscellaneous contigs, perhaps annotations from repeat masker, etc.) and then convert the resulting bed file to a genome mask (fasta file) and index the fasta file.

  • reference.lcmask.fasta
  • A genome mask (in indexed fasta format) that masks low-complexity sequence and is used in read-depth analysis.

    This file is optional. For human, we created this mask using BedFileToGenomeMask based on the repeat masker track from the UCSC genome browser and selecting regions from categories "Low_complexity", "Satellite" or "Simple_repeat".

  • reference.rdmask.bed
  • A bed file specifying a set of regions to use for overall sequencing depth estimation. For human, we used all of the base autosomal sequence in the reference genome. You should create a bed file that excludes sex chromosomes or any other contigs known to have variable ploidy.

  • reference.gcprof.zip
  • It is not necessary to create this file in your reference metadata bundle. If it is not present, the SVPreprocess pipeline will create the same file locally in your dataset metadata directory.

  • reference.dict
  • A picard-format dictionary file for the reference genome. See the Picard documentation for the CreateSequenceDictionary utility.

  • mdversion.txt
  • Used for version control, you can copy this from any human reference metadata bundle.