MakeReference Manual (v1.0)

-- MakeReference is a companion software to SNP2HLA. MakeReference builds a reference panel that can be subsequently used by SNP2HLA, given SNP genotype data and typed HLA allele data of the reference individuals.
<< Go back to SNP2HLA main homepage


1. Download Plink (v1.07) for your platform HERE . Copy the "plink" run file into the current directory (with MakeReference.csh).
2. Download Beagle (version 3.0.4) .jar files into the current directory. - "beagle.jar" from HERE. - "beagle2linkage.jar" and "linkage2beagle.jar" from HERE. We recommend downloading version 3.0.4 for the compatability issue, even if it is not the newest version. Beagle web page described above includes links for all past-version binaries.

Files in the package

<< Download MakeReference at the main homepage
1. MakeReference.csh: Creates a reference panel using SNP and HLA data.
2. Generates amino acid or intragenic SNP variants given HLA alleles
3. Encodes HLA alleles into binary markers (for imputation with posterior probabilities using Beagle)
4. Encodes genetic positions for polymorphic HLA variants
5. Sample HapMap CEU dataset (n=180) with HLA alleles
6. HLA amino acid / protein sequence dictionary:
7. HLA intragenic SNPs / DNA sequence dictionary:


1. SNP dataset (.bed/bim/fam PLINK format)
2. HLA types for SNP dataset individuals

Format of HLA types PED file

It is important to format the PED file correctly.
HLA PED file contains the 4-digit HLA alleles (FID,IID,pID,mID,SEX,PHENO,A,B,C,DPA1,DPB1,DQA1,DQB1,DRB1).
HLA alleles must be in the following (alphabetical) order:

  HLA-A, B, C, DPA1, DPB1, DQA1, DQB1, DRB1  

Note that each HLA allele spans two columns per individual (i.e. two chromosomes).
Put "0" for unknown HLA alleles.
As a result, an example row of HLA PED file will be
1334    NA10847 NA12146 NA12239 2       0       2501    0301    0801    1801    0701    1203    0       0       0      0       0101    0101    0501    0501    0101    0101
This person's HLA-A types are HLA-A*25:01 and HLA-A*03:01 (7th and 8th columns)
This person's HLA-DRB1 types are HLA-DRB1*01:01 and HLA-DRB1*01:01 (the last two columns)
This person's HLA-DPA1 and HLA-DPB1 types are unknown (coded as "0")
Look at our example file HAPMAP_CEU_HLA.ped for an example formatting.
If you have only 2-digit information, you can put 2-digit instead of 4-digit. (For example, if this person has only 2-digit information of HLA-DRB1, you can write "01" instead of "0101").

Running command

      ./MakeReference.csh DATA (.bed/.bim/.fam) HLA_TYPE_DATA (.ped) OUTPUT plink 


Run MakeReference with sample data provided (HapMap CEU dataset) using the following command:

./MakeReference.csh HAPMAP_CEU HAPMAP_CEU_HLA.ped HM_CEU_REF plink

In the above example,
- HAPMAP_CEU is the SNP genotype plink files (.bed/.bim/.fam),
- HLA_TYPE_DATA is the HLA types for SNP dataset individuals
- HM_CEU_REF is the output reference dataset (.bgl.phased/.markers)
- plink is the pointer to the PLINK software

Output Files

1. New reference panel in Plink format, containing SNPs, HLA alleles, HLA amino acids, and HLA intragenic SNPs (.bed/bim/fam)
2. Allele frequences for all variants in reference panel (.FRQ.frq)
3. Beagle format phased haplotypes (.bgl.phased)
4. A file denoting marker positions and order (.markers) for subsequent imputation

Marker Nomenclature

For binary encodings, P = Present, A = Absent.

1. Classical HLA alleles: HLA_[GENE]_[ALLELE].
- HLA_C_0304 = HLA-C:03:04 (four-digit allele)
- HLA_DRB1_07 = HLA-DRB1:07 (two-digit allele)

- AA_A_56_30018678_G = amino acid 56 of HLA-A, genetic position 30018678 (center of codon), allele = G (Gly) of multi-allelic position
- AA_C_291_31345793 = amino acid 291 of HLA-C, genetic position 31345793, bi-allelic (check {OUTPUT}.bim for alleles, P = Present, A = Absent)

- SNP_B_31430319_G = SNP at position 31430319 of HLA-B, allele = G (guanine) of multi-allelic position
- SNP_DRB1_32659974 = SNP at position 32659974 of HLA-DRB1, bi-allelic (check {OUTPUT}.bim for alleles, P = Present, A = Absent)
- SNP_DQB1_32740666_AT = SNP at position 32740666 of HLA-DQB1, alleles = A (adenine) or T (thymine), (check {OUTPUT}.bim for alleles, P = Present, A = Absent)

4. Insertions / deletions: [VARIANT]_[GENE]_[POSITION]_[INSERTION/x=DELETION]
- AA_C_339_31345102_x = deletion at amino acid 339 in HLA-C, genetic position 31345102 (center of codon), (check {OUTPUT}.bim for alleles, P = deletion Present, A = deletion Absent)
- INS_C_295x296_31345779_VLAVLA = insertion between amino acids 295 and 296 of HLA-C, amino acid sequence inserted = VLAVLA, (check {OUTPUT}.bim for alleles, P = insertion Present, A = insertion Absent)
- SNP_DQA1_32717217_x = deletion at genetic position 32717217 of HLA-DQA1, (check {OUTPUT}.bim for alleles, P = deletion Present, A = deletion Absent)


Please email questions to the SNP2HLA team at snp2hla at broad . mit . edu


Xiaoming Jia*, Buhm Han*, Suna Onengut-Gumuscu, Wei-Min Chen, Patrick J. Concannon, Stephen S. Rich, Soumya Raychaudhuri, Paul I.W. de Bakker. "Imputing Amino Acid Polymorphisms in Human Leukocyte Antigenes." PLoS One. In press. 2013.

<< Go back to SNP2HLA main homepage