This document covers the specifics of human genome reference assemblies. For more general information about reference genomes, including definitions of specialize terms used here, please see the Dictionary entry on Reference genomes. We recommend reading that article before tackling this one. For help dealing with reference compatibility problems, see this Solutions doc. For information on the FASTA format and accompanying index files, see the Dictionary entry on FASTA.
Successive "versions" of the human genome reference, commonly called assemblies or builds, have been published since the original draft Human Genome Project publication, bringing gradual improvements in quality made possible by technological advances, as well as improvements in the representativeness of the reference genome sequence with regard to historically underrepresented populations.
Arguably the most significant improvements have been made in the representation of so-called alternate haplotypes, i.e. regions that are sometimes dramatically different in different populations.
In a perfect world, the human reference genome should represent all of humanity faithfully. In practice, this is rather difficult to achieve due to the great diversity found in some parts of the human genome, compounded by historical bias in the selection of participants in genomic studies. As a result, populations whose genetic makeup is not commonly shared in European and North American nations have been historically underrepresented in the reference genome sequence. This has very real implications, including in terms of clinical outcomes, since our ability to identify meaningful variation in an individual's genome sequence is directly dependent on our ability to distinguish what is healthy from what might be pathological.
The latest build of the human reference genome, officially named GRCh38 (for Genome Research Consortium human build 38) but commonly nicknamed Hg38 (for Human genome build 38), greatly expanded the repertoire of ALT contigs. These represent alternate haplotypes and have a significant impact on our power to detect and analyze genomic variation that is specific to populations that carry alternate haplotypes. Read on further below for more details.
Although the corresponding improvements to the human reference genome have been overwhelmingly positive overall, the existence of these different builds, compounded by the advent of two parallel streams of reference genome evolution (named hg vs b, published by different groups [need to look up details], with different naming conventions, some differences in sequences, and inclusion of different non-canonical contigs) has caused much confusion and plentiful errors over the years. Part of the problem is that many bioinformatic tools fail to enforce consistent use of a specific reference. This allows the unwary user to switch reference genomes halfway through a project without realizing that their comparisons suddenly become worthless [because e.g. now all the positions are shifted by some coordinate index]. In contrast, toolkits such as GATK and Picard are almost painfully insistent on validating reference identity (via the sequence dictionary) before proceeding with analysis. This can admittedly be a source of great frustration, but it is a necessary safeguard that has saved many an analysis!
GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs were also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.
We strongly recommend switching to GRCh38/hg38 if you are working with human sequence data. In addition to adding many alternate contigs, GRCh38 corrects thousands of small sequencing artifacts that cause false SNPs and indels to be called when using the GRCh37 assembly (b37/Hg19). It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.
The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.
chr22), X (
chrX), Y (
chrY) and Mitochondrial (
The GRCh38 ALT contigs are recognizable by their
_alt suffix; they amount to a total of 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.
Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.
Together, the Pseudo-Autosomal Regions (PAR) sequences on X and Y essentially create a diploid region, so they are intentionally made identical in the genome assembly. In the analysis set version of the genome, two of the Y chromosome PAR regions are hard-masked so as to allow mapping of reads solely to the X chromosome PAR regions. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415. In the IGV-based figure below, you can see a section of the chrY PAR1 that is hard-masked in the analysis set genome.
The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent soft-masked sequence corresponding to repeats from RepeatMasker and Tandem Repeats Finder.
Some additional regions on chromosomes 5, 14, 19, 21, & 22 that feature homologous centromeric and genomic repeat arrays are also hard-masked in the analysis set genome.
The GRCh38 analysis set also includes a contig to siphon off reads corresponding to the Epstein-Barr virus sequence, as well as decoy contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with EBV transformation. It also captures endogenous EBV sequence since EBV naturally infects B cells in ~90% of the world population.
Patches are intended to improve representation or add information to the assembly without disrupting the chromosome coordinates. The naming convention is as follows: GRCh38.p7 indicates the seventh patched minor release of GRCh38. Note that the GATK team rarely if ever adopts patches due to constraints from our production operations. We are not currently able to provide support for the use of patches.
For these builds, the primary assembly coordinates are identical for the original release but patch updates were different. In addition, the naming conventions of the references differ, e.g. the use of
chr1(in hg19) versus
1 (in b37) to indicate chromosome 1, and chrM vs. MT for the mitochondrial genome. Included decoys were also different. So it is possible to lift-over resources from one to the other, but it should be done using Picard LiftoverVcf with the appropriate chain files. Trying to convert between them just by renaming contigs is a bad idea. And in the case of BAMs, well, the bad news is that if you have a BAM aligned to one reference build but you need the other, you'll have to re-map the data from scratch.
These assemblies had major differences and have largely been abandoned. We no longer support their use with GATK. If you have old data derived from these references you should really consider reprocessing everything.
Several key steps in the GATK Best Practices workflows require truth sets, known variants etc. that are derived from the reference you're using. We make sets of suitable resources available for the supported reference builds. If you're working on your own installation of GATK, you can get these from the Resource Bundle. If you're using GATK on FireCloud, our cloud-based analysis platform, the featured GATK workspaces are preloaded with the appropriate resources.
The UCSC Genome Browser allows browsing and download of genomes, including analysis sets, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:
585 chr14 0 10000 1 N 10000 telomere no 1 chr14 10000 16000000 2 N 15990000 short_arm no 707 chr14 16022537 16022637 4 N 100 contig no