BUHMBOX manual

<< Go back to BUHMBOX main homepage

Prerequisite

1. PLINK and R installed in the system
2. Disease A cases and controls: plink format.
3. List of disease B associated loci (with risk alleles, risk allele frequencies, and odds ratios)
4. Principal components, if available.

Running steps

  1. Prepare LD-pruned disease B associated loci "SNP File"

    Column 1: RSID, Column 2: risk allele, Column 3: risk allele frequency, Column 4: Odds ratio

    rs2843401   G   0.6718  1.1
    rs2301888   G   0.6515  1.11
    rs2306627   A   0.3005  1.1
    rs883220    C   0.7508  1.11
    rs2476601   A   0.09966 1.8
    

    Note: everything has to be "risk allele", not protective allele! Thus, OR will be always >1.0. -- Update(11/19/15): As of v0.37, this restriction is deprecated; you can put in protective alleles (OR<1). BUHMBOX will automatically flip them internally.
    Note: Column 3 and 4 can be "NA" in which case it will be mean-value-imputed.
    Note: Recommended value for LD pruning is r2=0.1
    If you are using PLINK assoc file, you can use clumping: e.g. --clump ASSOCFILE --clump-r2 0.1 --clump-kb 10000
    If you are using SNP list, you can use pruning: e.g. --extract SNPLIST --indep-pairwise 50 5 0.1 --filter-controls
    Note: For LD pruning, I recommend using only controls for LD calculation, becaues correlations in cases might include the artificial correlations that we want to detect, which was caused by heterogeneity.
    Note: Allele frequency is control allele frequency, but using case or all sample frequency (or reported allele frequency in literature) doesn't affect results much.

  2. Extract disease B associated loci from plink data (dosage RAW format, case/control separately.)

    Given the SNP File from Step 2, we will only need the first two columns: for example in unix, you can

      cut -f1-2 SNPFILE > SNPFILE2
    
    Then let's extract dosage raw format
      plink --noweb --allow-no-sex --bfile YOURPLINKFILE --filter-controls --extract SNPFILE2 --reference-allele SNPFILE2 --recodeA --out CONTROLRAWFILE
    plink --noweb --allow-no-sex --bfile YOURPLINKFILE --filter-cases --extract SNPFILE2 --reference-allele SNPFILE2 --recodeA --out CASERAWFILE
    Of course, you can use --keep INDIVIDUALFILE if you want to use individual ID file to filter out cases/conrols, instead of --filter-controls or --filter-cases.

  3. Run BUHMBOX
        Rscript buhmbox.R SNPFile CASERAWFILE CONTROLRAWFILE YY N Y OUTPUT [PCFILE]
    

    Details of Arguments:

    1 SNPFile The SNP File we generated at Step 1. (four-column format)
    2 CASERAWFILE The plink dosage raw format of cases. Likely has extension of .raw
    3 CONTROLRAWFILE The plink dosage raw format of controls. Likely has extension of .raw
    4 Method mode Each of two letters represents freq mode, and OR mode.
    YY: use both freq and OR; YN: use only freq; NY: use only OR; NN: use neither.
    If freq / OR are not used, default values 0.5 / 1.2 are used for all SNPs. For typical situations, use default "YY". (Other modes are used to evaluate power of method, if some information is ignored.)
    5 Estimation mode "Y": perform estimation -- BUHMBOX estimates proportion of disease-B cases that are misclassified in disease A cases.
    "N": do not perfom estimation. (much faster.)
    Default is "N" for efficiency. Use "Y" if you observe interesting results and do further analysis.
    6 Mendelian randomization "Y": perform Mendelian randomization (risk score approach) -- using the risk alleles, BUHMBOX will calculate genetic risk score of each individual (based on disease B risk alleles) and associates the score to the disease A status, to calculate p-value. (PC accounted, if provided.)
    "N": don't perform Mendelian randomization.
    7 OUTPUT File Output file name
    8 PC File (Optional) Optional PC File. First column: FID, Second column: IID, and other columns: PCs. BUHMBOX compares FID and IID to match individuals. Thus, (1) PC File individuals don't have to be in the same order as in plink file, and (2) PC File may contain extra individuals. (not used anyways)

    Notes:

Output

In output result file:
PVALUE BUHMBOX p-value. If significant, it indicates excessive positive correlations than expected.
LOG10P Log10 of PVALUE
N The number of case individuals that were used in calculation. (After internal QC, which may have removed some individuals due to e.g. missing alleles
ZSCORE BUHMBOX Z-score
MLE.LEFT Maximum likelihood estimation of misclassification rate, between 0-50%
LL.LEFT Log likelihood of MLE.LEFT
MLE.RIGHT Maximum likelihood estimation of misclassification rate, between 50-100%
LL.RIGHT Log likelihood of MLE.RIGHT
MR.PVALUE Mendelian randomization (risk score approach) p-value
MR.LOG10P Log10 of MR.PVALUE
MR.BETA Beta of Mendelian randomization
MR.STDERR Standard error of MR.BETA

Notes:

Also you will find a log file that captures the progress report.

Interpretation

If BUHMBOX P-value is significant, it indicates that the disease-B-loci are excessively positive correlated in disease A cases than expected. This might be due to

If BUHMBOX P-value is not significant (while Mendelian randomization is significant), it might be due to

Frequently Asked Questions

1. What if both pleiotropy and heterogeneity exist? What happens?
2. How many SNPs / how large samples are needed for BUHMBOX?
3. Does BUHMBOX work for polygenic model? (using moderately significant SNPs)

1. What if both pleiotropy and heterogeneity exist? What happens?
BUHMBOX detects heterogeneity regardless of pleiotropy. If both exist, and if BUHMBOX has sufficient power, BUHMBOX will reject the null due to heterogeneity. Note that if BUHMBOX rejects null, it's not evidence of "absence of pleiotropy".
See the following comparison:
BUHMBOX:
No pleiotropy, No heterogeneity: Null
No pleiotropy, Yes heterogeneity: Alternative
Yes pleiotropy, No heterogeneity: Null
Yes pleiotropy, Yes heterogeneity: Alternative

Mendelian randomization:
No pleiotropy, No heterogeneity: Null
No pleiotropy, Yes heterogeneity: Alternative
Yes pleiotropy, No heterogeneity: Alternative
Yes pleiotropy, Yes heterogeneity: Alternative

2. How many SNPs / how large samples are needed for BUHMBOX?
Please refer to the following graphs:(based on risk alelle frequency and odds ratio sampled from GWAS catalog)
The first plot is when #SNPs=50, and the second plot is when #samples=2000. This shows that BUHMBOX has plenty of power in practical situations of current studies for many diseases.

3. Does BUHMBOX work for polygenic model? (using moderately significant SNPs)
By simulations and real data, we found that BUHMBOX can often gain power by using more SNPs that did not meet GWAS threshold (5E-8). For example you can use all SNPs that are < 1E-6. When aggregating correlations into a statistic, BUHMBOX accounts for odds ratio and frequencies. Thus, rarer SNPs or SNPs with small effects will be downweighted. That's why adding more and more moderately significant SNPs does not hurt BUHMBOX much, but often helps.

<< Go back to BUHMBOX main homepage