SNP2HLA Manual (v1.0)

-- SNP2HLA imputes amino acids, HLA alleles, and SNPs in MHC region from SNP genotype data.
<< Go back to SNP2HLA main homepage

Prerequisite

1. Download Plink (v1.07) for your platform HERE . Copy the "plink" run file into the current directory (with SNP2HLA.csh).
2. Download Beagle (version 3.0.4) .jar files into the current directory. - "beagle.jar" from HERE.
We recommend downloading version 3.0.4 for the compatability issue, even if it is not the newest version.
Beagle web page includes links for all past-version binaries.
The zip file includes "linkage2beagle.jar" in the "utility" directory. Copy this to the current directory, too.
3. Download "beagle2linkage.jar" from HERE and copy to the current directory.

Files in the package

<< Download SNP2HLA at the main homepage
1. SNP2HLA.csh: Performs imputation (via Beagle) after SNP QC (using PLINK)
2. Merge_tables.pl: Merges files according to indices in a particular column (called by SNP2HLA.csh)
3. ParseDosage.csh: Converts .gprobs (Beagle) file to .dos (Dosage) file
4. HapMap CEU reference dataset (Plink and Beagle formats)
5. Sample SNP dataset of 10 individuals from Britist 1958 Birth Cohort (1958BC.bed/.bim/.fam)

Input

1. SNP dataset (.bed/bim/fam PLINK format)
*** We compare rsIDs to Reference, so coordinates (hg18/hg19) are not important. ***
2. Reference dataset (.bgl.phased/.markers Beagle format)

Running command

      ./SNP2HLA.csh DATA (.bed/.bim/.fam) REFERENCE (.bgl.phased/.markers) OUTPUT plink {optional: max_memory[mb] window_size}

Example

Run SNP2HLA with sample data provided (10 samples from British 1958 Birth Cohort, HapMap CEU reference dataset) using the following command:

./SNP2HLA.csh 1958BC HM_CEU_REF 1958BC_IMPUTED plink 2000 1000

In the above example,
- 1958BC is the SNP genotype plink files (.bed/.bim/.fam),
- HM_CEU_REF is the reference dataset (.bgl.phased/.markers)
- plink is the pointer to the PLINK software
- 2000 is the maximum java heap size (in mb) for imputation using Beagle (user can increase as needed)
- 1000 is the marker window sizw that Beagle uses for phasing and imputation

SNP2HLA will also run with default parameters if memory and window size are not prespecified (java mamory = 2Gb, marker window size = 1000) such as:
./SNP2HLA.csh 1958BC HM_CEU_REF 1958BC_IMPUTED plink

Output Files

{OUTPUT}.dosage: Imputed allele dosage data (recommended for downstream analysis)
Beagle dosage format -- Rows are markers and columns are individuals, one column per individual, where the individuals are in the same order as in {OUTPUT}.bgl.phased below
{OUTPUT}.bgl.phased: imputation best-guess genotypes
Beagle phased output format -- Rows are markers and columns are individuals, two columns per individual.
{OUTPUT}[.bed/.bim/.fam/.ped/.map]: PLINK format best-guess genotype files
{OUTPUT}.bgl.gprobs: imputation posterior probabilities for SNPs, HLA alleles, and HLA amino acids
{OUTPUT}.bgl.r2: imputation predicted r2 with true genotypes
*** Output coordinates are all in hg18 currently. ***

Association testing in PLINK

      plink --noweb --dosage OUTPUT.dosage noheader format=1 --fam OUTPUT.fam --logistic --out OUTPUT.assoc

Marker Nomenclature

For binary encodings, P = Present, A = Absent.

1. Classical HLA alleles: HLA_[GENE]_[ALLELE].
- HLA_C_0304 = HLA-C:03:04 (four-digit allele)
- HLA_DRB1_07 = HLA-DRB1:07 (two-digit allele)

2. HLA Amino Acids: AA_[GENE]_[AMINO ACID POSITION]_[GENETIC POSITION]_[ALLELE].
- AA_A_56_30018678_G = amino acid 56 of HLA-A, genetic position 30018678 (center of codon), allele = G (Gly) of multi-allelic position
- AA_C_291_31345793 = amino acid 291 of HLA-C, genetic position 31345793, bi-allelic (check {OUTPUT}.bim for alleles, P = Present, A = Absent)

3. HLA intragenic SNPS: SNP_[GENE]_[POSITION]_[ALLELE]
- SNP_B_31430319_G = SNP at position 31430319 of HLA-B, allele = G (guanine) of multi-allelic position
- SNP_DRB1_32659974 = SNP at position 32659974 of HLA-DRB1, bi-allelic (check {OUTPUT}.bim for alleles, P = Present, A = Absent)
- SNP_DQB1_32740666_AT = SNP at position 32740666 of HLA-DQB1, alleles = A (adenine) or T (thymine), (check {OUTPUT}.bim for alleles, P = Present, A = Absent)

4. Insertions / deletions: [VARIANT]_[GENE]_[POSITION]_[INSERTION/x=DELETION]
- AA_C_339_31345102_x = deletion at amino acid 339 in HLA-C, genetic position 31345102 (center of codon), (check {OUTPUT}.bim for alleles, P = deletion Present, A = deletion Absent)
- INS_C_295x296_31345779_VLAVLA = insertion between amino acids 295 and 296 of HLA-C, amino acid sequence inserted = VLAVLA, (check {OUTPUT}.bim for alleles, P = insertion Present, A = insertion Absent)
- SNP_DQA1_32717217_x = deletion at genetic position 32717217 of HLA-DQA1, (check {OUTPUT}.bim for alleles, P = deletion Present, A = deletion Absent)

Reference datasets

T1DGC reference panel is now removed from the package V1.0.3 due to security issues relating to individual-level genotype data (3/10/15). If you are a researcher interested in obtaining access to this reference panel, please contact the NIDDK Central Repository (https://repository.niddk.nih.gov/studies/t1dgc-special/?query=snp2hla)
Larger reference panels facilitate more accurate imputations. Type 1 Diabetes Genetics Consortium collected a high quality HLA reference panel of large sample (N=>5,000), which was evaluated in our publication (Jia, Han, et al PLoS One 2013). T1DGC reference panel is now available as part of the package!
Pan-Asian reference panel is available as part of the package. (V1.0.2, 7/10/14)

Frequently Asked Questions

1. Why does SNP2HLA encode so many different types of markers?
2. What kind of analyses can I do with the output?
3. Some SNPs in my original data are missing in the output.
4. Is there computational requirement?
5. SNP2HLA crashed.
6. SNP2HLA still crashes even if I increase memory.
7. I have additional questions. Who can I contact?
8. Do you have any publication describing the method?

1. Why does SNP2HLA encode so many different types of markers?

SNP2HLA encodes different types of markers: (1) binary marker for HLA alleles, (2) binary marker for the presence/absence of a specific amino acid residue, (3) binary markers for the presence/absence of a SET of amino acids residues given a multi-allelic amino acid position, (4) HLA intragenic SNPs, (5) binary markers for insertion/deletions, as described in Nomenclature above. The goal is to minimize prior assumption on which types of variations will be causal and test all types of variations simultaneously in an unbiased fashion. However, the users are always free to restrict analyses to specific marker subsets.

2. What kind of analyses can I do with the output?

SNP2HLA output flexibly allows users to apply many different types of downstream analysis. First, you can test all defined markers (regardless of their types -- HLA alleles, amino acids, SNPs) to examine the top statistical peak. (We recommend dosage output than the best-guess to account for uncertainties.) It will be interesting if the peak comes from an HLA allele, amino acid, or a SNP. If a promising amino acid position comes out, you can also test the position (typically with multi-allelic residues) with an omnibus test, to examine if the position is still the top within the HLA gene in the omnibus test. You can also do conditional analysis where one marker is conditioned and the next independent peak is sought. Since SNP2HLA provides phased haplotypes, haplotype analysis is also possible. For examples please refer to (Pereyra et al Science 2010, Raychaudhuri et al Nature Genetics 2012).

3. Some SNPs in my original data are missing in the output.

This is because SNP2HLA processes the data to exclude SNPs that are not in reference, which would not help the imputation. This does not usually affect the analysis results because the top signal should come from the HLA genes after imputation, while the SNPs before imputation should be outside of the HLA genes. However, if one wants, one can merge the results files (e.g. PLINK format) to the original SNP data before analysis.

4. Is there computational requirement?

Depending on the sample size, SNP2HLA may require a large memory. What matter is the total number of individuals in the reference dataset + sample. For example, imputing 7,000 individuals required 10Gb memory in our trial runs. Note that one should specify the memory size in the running command for large dataset.

5. SNP2HLA crashed.

Similarly to any other software, SNP2HLA can behave unexpectedly. For example, the internal Beagle run may crash because of insufficient memory. If your dataset (reference + sample) is >5,000, try increased memory size (e.g. 10Gb) using max_memory[mb] option in the running command.

6. SNP2HLA still crashes even if I increase memory.

The internal Beagle run can crash if the sample size is too large, even with increased memory. We found that if you use T1DGC reference panel (which is already ~5,000) the largest sample size Beagle can run on within a reasonable amount of time (1-2 weeks) is about 3,000 - 4,000. This is because the total size of samples matters and Beagle often can't handle sample size close to 10K. The current solution would be splitting your data set into multiple data and run SNP2HLA separately. If you have multiple cohorts, then that would be the natural choice of split.

7. I have additional questions. Who can I contact.

Please email the SNP2HLA team at snp2hla AT broadinstitute DOT org

8. Do you have any publication describing the method?

SNP2HLA is described in
Xiaoming Jia*, Buhm Han*, Suna Onengut-Gumuscu, Wei-Min Chen, Patrick J. Concannon, Stephen S. Rich, Soumya Raychaudhuri, Paul I.W. de Bakker. "Imputing Amino Acid Polymorphisms in Human Leukocyte Antigenes." PLoS One. 8(6):e64683. 2013.

<< Go back to SNP2HLA main homepage