Toy examples of fitting mJAM

Installation

Please install the package hJAM (Hierarchical Joint Analysis of Marginal Summary Statistics) from github as mJAM is developed as part of the hJAM package. The hJAM package also contains other JAM-family approaches: hJAM (for MR and TWAS) and SHAJAM (for MR and TWAS with high-dimensional intermediates).

devtools::install_github("USCbiostats/hJAM")

Special notes for macOS users

For macOS users with apple silicon (M1 Macs), please install R version 4.2-arm64 from https://mac.r-project.org/ . This is a specific build for compatibility with the M1 arm-based Macs. One of crucial dependencies in mJAM, Rmpfr, is not yet compatible with Intel-based R versions running on M1 Macs and it will cause R/Rstudio to abort. But using version 4.2-arm64 will solve this issue.

library(hJAM)
library(dplyr) ## for the use of piping operator %>%

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Example 1: Complete SNP information

In the first example, we’ll illustrate the usage of mJAM-SuSiE and mJAM-Forward with a simulated dataset, consisting of 3 ancestry groups and 50 SNPs. These 50 SNPs consist of 5 LD blocks, each with 10 SNPs and uniform pairwise LD between them. The first SNP, named “rs1”, is the only true causal SNP.

First, let’s load the marginal summary statistics of these 50 SNPs.

# Load marginal betas data 
MargBeta <- read.table("./Example-Data/Marg_1.txt")
names(MargBeta) <- c("SNP", "MargBeta_P1", "MargSEBeta_P1", 
                      "MargBeta_P2", "MargSEBeta_P2",
                      "MargBeta_P3", "MargSEBeta_P3")
head(MargBeta)

##   SNP MargBeta_P1 MargSEBeta_P1 MargBeta_P2 MargSEBeta_P2 MargBeta_P3
## 1 rs1  0.06054038    0.02489064 0.077964524    0.02068987 0.026311626
## 2 rs2 -0.01068437    0.02126704 0.040433268    0.01813179 0.024127159
## 3 rs3  0.05264446    0.02153163 0.037045359    0.01838805 0.019806117
## 4 rs4 -0.01167104    0.02121476 0.009254768    0.01826285 0.006243936
## 5 rs5  0.03411272    0.02145365 0.032849181    0.01819438 0.024697970
## 6 rs6  0.02948659    0.02152562 0.049799131    0.01836573 0.016761096
##   MargSEBeta_P3
## 1    0.02020833
## 2    0.01781806
## 3    0.01786128
## 4    0.01792291
## 5    0.01774054
## 6    0.01782569

Then load the effect allele frequencies (not necessarily the minor alleles).

# Load effective allele frequency data
EAF <- read.table("./Example-Data/EAF_1.txt")
names(EAF) <- c("SNP", "EAF_P1", "EAF_P2", "EAF_P3")
head(EAF)

##   SNP EAF_P1 EAF_P2 EAF_P3
## 1 rs1 0.2010 0.4072 0.6075
## 2 rs2 0.2374 0.4205 0.5825
## 3 rs3 0.2329 0.4102 0.5935
## 4 rs4 0.2394 0.4151 0.5850
## 5 rs5 0.2386 0.4158 0.5851
## 6 rs6 0.2344 0.4191 0.5895

Finally one reference dosage matrix for each ancestry group.

# Load individual-level SNP data from reference panel
RefDosage_P1 <- read.table("./Example-Data/Dosage_S1_1.txt", header = TRUE)
RefDosage_P2 <- read.table("./Example-Data/Dosage_S2_1.txt", header = TRUE)
RefDosage_P3 <- read.table("./Example-Data/Dosage_S3_1.txt", header = TRUE)
RefDosage_P1[1:5, 1:5]

##   rs1 rs2 rs3 rs4 rs5
## 1   0   0   0   0   0
## 2   0   0   0   0   0
## 3   0   1   0   0   1
## 4   1   1   1   0   1
## 5   1   1   1   0   1

RefDosage_P2[1:5, 1:5]

##   rs1 rs2 rs3 rs4 rs5
## 1   1   2   1   2   2
## 2   0   0   0   0   0
## 3   0   0   0   0   2
## 4   0   0   0   0   0
## 5   0   0   0   0   0

RefDosage_P3[1:5, 1:5]

##   rs1 rs2 rs3 rs4 rs5
## 1   0   1   0   1   0
## 2   1   0   0   1   1
## 3   1   1   1   1   0
## 4   2   2   1   1   2
## 5   1   1   2   0   1

Now we’ll use the marginal summary statistics (beta, se, EAF and reference dosage) to run mJAM-SuSiE and mJAM-SuSiE.

mJAM-Forward

# --- Fit mJAM_Forward 
Ex1_Forward_fit <- mJAM_Forward(N_GWAS = c(5000, 5000, 5000), 
                                X_ref = list(RefDosage_P1,RefDosage_P2,RefDosage_P3), 
                                Marg_Result = MargBeta,
                                EAF_Result = EAF, 
                                condp_cut = 0.05/50, # Bonferroni-corrected p-val
                                within_pop_threshold = 0.50, 
                                across_pop_threshold = 0.20,
                                coverage = 0.95)

## No. 1 selected index SNP: rs1

## 10 SNPs got pruned. 41 SNPs left.

## Continue to next round of index SNP selection.

## Analysis ended.

mJAM_Forward outputs a list with three tables: index, cs and marginal_est.

# Here're the selected index SNP(s)
Ex1_Forward_fit$index

## # A tibble: 1 × 6
##   SNP   b_joint b_joint_var cond_log10p final_log10p  pcut
##   <chr>   <dbl>       <dbl>       <dbl>        <dbl> <dbl>
## 1 rs1    0.0544    0.000156       -4.86        -4.86 0.001

index table lists all the index SNP(s) being selected (SNP), along with their log10(p-value) conditional on all SNP(s) above (cond_log10p), the log10(p-value) conditional on all other index SNP(s) (final_log10p), and the p-value threshold used in this analysis (pcut). b_joint is the conditional effect estimates in the final model, and b_joint_var is the squared standard error of b_joint.

We choose to report log10(p-value) instead of p-value on the original scale because R does not support p-value precision smaller than around 1e-300. To avoid ambiguous results under large-effect regions, we calculate p-values on the log/log10 scales and report these.

# Here're the credible sets SNPs
Ex1_Forward_fit$cs %>% filter(CS_in == T)

## # A tibble: 3 × 7
##   index_SNP CS_SNP Post_Model_Prob Post_Med_Prob SD_Post_CS_Prob CumSum_…¹ CS_in
##   <chr>     <chr>            <dbl>         <dbl>           <dbl>     <dbl> <lgl>
## 1 rs1       rs1             1              1              0.768      0.768 TRUE 
## 2 rs1       rs8             0.207          0.826          0.131      0.899 TRUE 
## 3 rs1       rs10            0.0859         0.850          0.0560     0.955 TRUE 
## # … with abbreviated variable name ¹CumSum_Prob

cs table records the posterior probabilities of all SNPs being considered for credible set SNPs. Post_Model_Prob stands for posterior model probability which quantifies the marginal association between CS_SNP and its index_SNP. Post_Med_Prob stands for posterior mediation probability which quantifies the mediation effect of the CS_SNP through index_SNP. Stronger LD between CS_SNP and index_SNP often leads to larger Post_Med_Prob. By taking the product of Post_Model_Prob and Post_Med_Prob and then standardizing, we have SD_Post_CS_Prob which we use to determine the credible set.

Here we only print out those that are selected in the 95% mJAM credible set (rs1, rs8 and rs10) by filtering this table using CS_in == TRUE.

Ex1_Forward_fit$mJAM_marg_est %>% head()

## # A tibble: 6 × 4
##   SNP   mJAM_effect mJAM_se mJAM_log10p
##   <chr>       <dbl>   <dbl>       <dbl>
## 1 rs1       0.0544   0.0125     -4.87  
## 2 rs2       0.0206   0.0112     -1.18  
## 3 rs3       0.035    0.0113     -2.71  
## 4 rs4       0.00245  0.0112     -0.0826
## 5 rs5       0.0302   0.0112     -2.15  
## 6 rs6       0.0323   0.0112     -2.41

If you’d like to check out the multi-population marginal estimates from mJAM, mJAM_marg_est provides the marginal estimates of all SNPs under the mJAM model. This includes mJAM_effect (the point estimate), mJAM_se (the standard error of the point estimate) and mJAM_log10p (log10 of the p-value based on point estimate and standard error).

mJAM-SuSiE

# --- Fit mJAM_SuSiE
Ex1_SuSiE_fit <- mJAM_SuSiE(marginal.betas = list(MargBeta$MargBeta_P1, MargBeta$MargBeta_P2, MargBeta$MargBeta_P3), 
                      marginal.se = list(MargBeta$MargSEBeta_P1, MargBeta$MargSEBeta_P2, MargBeta$MargSEBeta_P3), 
                      EAFs = list(EAF$EAF_P1, EAF$EAF_P2, EAF$EAF_P3), 
                      N_GWAS = c(5000, 5000, 5000),
                      X_ref = list(RefDosage_P1,RefDosage_P2,RefDosage_P3),
                      SNP_names = MargBeta$SNP,
                      SuSiE_num_comp = 10, 
                      SuSiE_coverage = 0.95)
## --- mJAM_SuSiE results
mJAM_SuSiE_get_cs(susie_fit = Ex1_SuSiE_fit$fit, SNP_names = MargBeta$SNP)

##   index  coverage CS_size index_SNP CS_SNP
## 1    L1 0.9558736       2       rs1    rs1
## 2    L1 0.9558736       2       rs1    rs8

mJAM_SuSiE outputs a table with 5 columns: index, coverage, CS_size,index_SNP and CS_SNP.

index is a label for a distinct credible set.
coverage is the empirical coverage of this credible set.
CS_size shows how many SNPs in total are there in this credible set.
index_SNP is the name of the index SNP (SNP with highest posterior probability) in this credible set.
CS_SNP is names of individual SNPs selected in this credible set.

In this example, mJAM-SuSiE identifies one credible set with rs1 and rs8 in it. Both of them are included in the mJAM-Forward credible set as well.

Example 2: Missing SNP info in some populations

In the second example, we will explore the scenario when we have missing data in marginal summary statistics and in the reference dosage. The dataset used in example 2 is the same as in example 1, except that there are 2 SNPs made missing in either summary statistics or reference dosage: the dosage of rs15 in population 1 is missing; the summary statistic of rs20 in population 3 is missing.

Now let’s load the data:

# Load marginal betas data 
MargBeta <- read.table("./Example-Data/Marg_2.txt", 
                       colClasses = c("character", rep("numeric", 6)))
names(MargBeta) <- c("SNP", "MargBeta_P1", "MargSEBeta_P1", 
                     "MargBeta_P2", "MargSEBeta_P2",
                     "MargBeta_P3", "MargSEBeta_P3")

# Take a look at SNP No.20-25, rs20 in P3 missing
MargBeta[20:25,]

##     SNP  MargBeta_P1 MargSEBeta_P1 MargBeta_P2 MargSEBeta_P2  MargBeta_P3
## 20 rs20  0.001697950    0.02514703  0.01435850    0.02056006           NA
## 21 rs21 -0.010286390    0.02488344 -0.06636717    0.02066592 -0.009864962
## 22 rs22  0.035685926    0.02539317  0.03215446    0.02055493 -0.034835993
## 23 rs23  0.031534739    0.02484755  0.04407369    0.02058289 -0.026641422
## 24 rs24 -0.002939056    0.02526160  0.01915812    0.02059609 -0.008948019
## 25 rs25 -0.005859474    0.02549730 -0.01123426    0.02081805 -0.027068635
##    MargSEBeta_P3
## 20            NA
## 21    0.02024422
## 22    0.02054327
## 23    0.02026291
## 24    0.02053724
## 25    0.02019436

# Load effective allele frequency data
EAF <- read.table("./Example-Data/EAF_2.txt",
                  colClasses = c("character", rep("numeric", 3)))
names(EAF) <- c("SNP", "EAF_P1", "EAF_P2", "EAF_P3")

# Take a look at SNP No.20-25, rs20 in P3 missing
EAF[20:25, ]

##     SNP EAF_P1 EAF_P2 EAF_P3
## 20 rs20 0.1957 0.3927     NA
## 21 rs21 0.2046 0.3954 0.5990
## 22 rs22 0.1999 0.3968 0.5998
## 23 rs23 0.1995 0.4047 0.5998
## 24 rs24 0.1974 0.3973 0.6099
## 25 rs25 0.1979 0.4015 0.5992

Note that rs20’s beta,se, and EAF are missing in population 3.

# Load individual-level SNP data from reference panel
RefDosage_P1 <- read.table("./Example-Data/Dosage_S1_2.txt", header = TRUE)
RefDosage_P2 <- read.table("./Example-Data/Dosage_S2_2.txt", header = TRUE)
RefDosage_P3 <- read.table("./Example-Data/Dosage_S3_2.txt", header = TRUE)

# Notice that rs15 in P1 is missing
length(names(RefDosage_P1)); "rs15" %in% names(RefDosage_P1)

## [1] 49

## [1] FALSE

length(names(RefDosage_P2)); "rs15" %in% names(RefDosage_P2)

## [1] 50

## [1] TRUE

length(names(RefDosage_P3)); "rs15" %in% names(RefDosage_P3)

## [1] 50

## [1] TRUE

Note that rs15 is missing in the reference panel of population 1.

Note on input formatting requirement:

When combining summary data from multiple populations, remember to put NA into where the SNP information is missing in some populations but present in at least one other population. For SNPs that are missing in ALL populations, just exclude them.

# Fill in rs15 as NAs
RefDosage_P1 <- cbind(RefDosage_P1[,1:14], NA, RefDosage_P1[,15:49])
names(RefDosage_P1) <- names(RefDosage_P2)

Now we’ll use the marginal summary statistics (beta, se, EAF and reference dosage) to run mJAM-SuSiE and mJAM-SuSiE.

mJAM-Forward

Ex2_Forward_fit <- mJAM_Forward(N_GWAS = c(5000, 5000, 5000),
                                X_ref = list(RefDosage_P1,RefDosage_P2,RefDosage_P3),
                                Marg_Result = MargBeta,
                                EAF_Result = EAF,
                                condp_cut = 0.05/50, 
                                within_pop_threshold = 0.50,
                                across_pop_threshold = 0.20)

## No. 1 selected index SNP: rs1

## 10 SNPs got pruned. 41 SNPs left.

## Continue to next round of index SNP selection.

## Analysis ended.

## --- mJAM Forward results: index SNPs  
Ex2_Forward_fit$index

## # A tibble: 1 × 6
##   SNP   b_joint b_joint_var cond_log10p final_log10p  pcut
##   <chr>   <dbl>       <dbl>       <dbl>        <dbl> <dbl>
## 1 rs1    0.0544    0.000156       -4.86        -4.86 0.001

## --- mJAM Forward results: credible set SNPs
Ex2_Forward_fit$cs %>% filter(CS_in == T)

## # A tibble: 3 × 7
##   index_SNP CS_SNP Post_Model_Prob Post_Med_Prob SD_Post_CS_Prob CumSum_…¹ CS_in
##   <chr>     <chr>            <dbl>         <dbl>           <dbl>     <dbl> <lgl>
## 1 rs1       rs1             1              1              0.768      0.768 TRUE 
## 2 rs1       rs8             0.207          0.826          0.131      0.899 TRUE 
## 3 rs1       rs10            0.0859         0.850          0.0561     0.955 TRUE 
## # … with abbreviated variable name ¹CumSum_Prob

In this toy example, missing values do not affect mJAM_Forward’s results.

mJAM-SuSiE

# --- Fit mJAM_SuSiE, with missing SNPs. 
Ex2_SuSiE_fit <- mJAM_SuSiE(marginal.betas = list(MargBeta$MargBeta_P1, MargBeta$MargBeta_P2, MargBeta$MargBeta_P3), 
                       marginal.se = list(MargBeta$MargSEBeta_P1, MargBeta$MargSEBeta_P2, MargBeta$MargSEBeta_P3), 
                       EAFs = list(EAF$EAF_P1, EAF$EAF_P2, EAF$EAF_P3), 
                       N_GWAS = c(5000, 5000, 5000),
                       X_ref = list(RefDosage_P1,RefDosage_P2,RefDosage_P3),
                       p_cases = c(0.5, 0.5, 0.5), 
                       SNP_names = MargBeta$SNP,
                       SuSiE_num_comp = 10, 
                       SuSiE_coverage = 0.95)

## --- mJAM_SuSiE results
mJAM_SuSiE_get_cs(susie_fit = Ex2_SuSiE_fit$fit, SNP_names = MargBeta$SNP)

##   index  coverage CS_size index_SNP CS_SNP
## 1    L1 0.9560648       2       rs1    rs1
## 2    L1 0.9560648       2       rs1    rs8

In this toy example, missing values do not affect mJAM_SuSiE’s results.

Example 3: mJAM-Forward with pre-specified index

One practical advantage of mJAM-Forward is that it provides the flexibility to pre-specify your own index SNP by using the option index_snps in mJAM_Forward(). This will bypass the step of index SNP selection but still construct credible sets of the specified index SNPs. In the next example, we will specify two imaginary index SNPs: rs2 and rs11 by adding index_snps = c("rs2", "rs11") to mJAM_Forward().

Note that the order of index_snps matters: if we put index_snps = c("rs2", "rs11"), mJAM_Forward will first construct the credible set for rs2 and then prune out SNPs highly correlated with rs2. Then it will construct the credible set for rs11 and form credible set of rs11 by conditioning on the effect of rs2. If you’d like to get unconditional credible sets for rs2 and rs11 separately, we recommend running mJAM_Forward twice where specifying one index_snp at a time.

Ex3_Forward_fit <- mJAM_Forward(N_GWAS = c(5000, 5000, 5000),
                                X_ref = list(RefDosage_P1,RefDosage_P2,RefDosage_P3),
                                Marg_Result = MargBeta,
                                EAF_Result = EAF,
                                condp_cut = 0.05/50, 
                                index_snps = c("rs2", "rs11"), 
                                within_pop_threshold = 0.50,
                                across_pop_threshold = 0.20)

## No. 1 selected index SNP: rs2

## 2 SNPs got pruned. 49 SNPs left.

## Continue to next round of index SNP selection.

## No. 2 selected index SNP: rs11

## 3 SNPs got pruned. 49 SNPs left.

## Continue to next round of index SNP selection.

## Analysis ended.

## --- mJAM Forward results: index SNPs  
Ex3_Forward_fit$index

## # A tibble: 2 × 6
##   SNP   b_joint[,1] b_joint_var cond_log10p final_log10p  pcut
##   <chr>       <dbl>       <dbl>       <dbl>        <dbl> <dbl>
## 1 rs2        0.0210    0.000125       -1.19        -1.22 0.001
## 2 rs11      -0.0286    0.000125       -1.63        -1.63 0.001

## --- mJAM Forward results: credible set SNPs
Ex3_Forward_fit$cs %>% filter(CS_in == T)

## # A tibble: 15 × 7
##    index_SNP CS_SNP Post_Model_Prob Post_Med_Prob SD_Post_CS_Prob CumSum…¹ CS_in
##    <chr>     <chr>            <dbl>         <dbl>           <dbl>    <dbl> <lgl>
##  1 rs2       rs2              1           1               0.394      0.394 TRUE 
##  2 rs2       rs1              1           0.525           0.207      0.601 TRUE 
##  3 rs2       rs5              1           0.177           0.0697     0.671 TRUE 
##  4 rs2       rs6              1           0.159           0.0625     0.733 TRUE 
##  5 rs2       rs7              0.534       0.297           0.0624     0.795 TRUE 
##  6 rs2       rs3              1           0.137           0.0540     0.849 TRUE 
##  7 rs2       rs9              1           0.0854          0.0337     0.883 TRUE 
##  8 rs2       rs4              0.186       0.417           0.0305     0.914 TRUE 
##  9 rs2       rs10             1           0.0715          0.0282     0.942 TRUE 
## 10 rs2       rs8              1           0.0514          0.0203     0.962 TRUE 
## 11 rs11      rs11             1           1               0.913      0.913 TRUE 
## 12 rs11      rs26             1           0.0147          0.0134     0.927 TRUE 
## 13 rs11      rs9              1           0.0101          0.00925    0.936 TRUE 
## 14 rs11      rs6              0.708       0.0132          0.00853    0.944 TRUE 
## 15 rs11      rs35             1           0.00822         0.00751    0.952 TRUE 
## # … with abbreviated variable name ¹CumSum_Prob

In the above example, mJAM credible set for rs2 consists of all 10 SNPs in the same LD block as rs2. Conditional on rs2, the credible set for rs11 consists of 5 SNPs (all within the same LD block as rs11).

session info

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 13.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.0       hJAM_1.0.1        fontawesome_0.3.0
## 
## loaded via a namespace (and not attached):
##  [1] shape_1.4.6      tidyselect_1.2.0 xfun_0.32        bslib_0.4.0     
##  [5] reshape2_1.4.4   splines_4.2.0    lattice_0.20-45  colorspace_2.1-0
##  [9] vctrs_0.5.2      generics_0.1.3   htmltools_0.5.3  yaml_2.3.5      
## [13] gmp_0.7-1        utf8_1.2.3       survival_3.3-1   rlang_1.0.6     
## [17] jquerylib_0.1.4  pillar_1.8.1     glue_1.6.2       Rmpfr_0.9-1     
## [21] withr_2.5.0      foreach_1.5.2    lifecycle_1.0.3  plyr_1.8.8      
## [25] stringr_1.5.0    munsell_0.5.0    gtable_0.3.1     codetools_0.2-18
## [29] evaluate_0.16    knitr_1.40       fastmap_1.1.0    fansi_1.0.4     
## [33] Rcpp_1.0.10      scales_1.2.1     susieR_0.10.0    cachem_1.0.6    
## [37] jsonlite_1.8.0   ggplot2_3.4.1    digest_0.6.29    stringi_1.7.12  
## [41] grid_4.2.0       cli_3.6.0        tools_4.2.0      magrittr_2.0.3  
## [45] sass_0.4.2       glmnet_4.1-6     tibble_3.1.8     pkgconfig_2.0.3 
## [49] Matrix_1.4-1     matrixcalc_1.0-6 rmarkdown_2.14   reshape_0.8.9   
## [53] rstudioapi_0.14  iterators_1.0.14 R6_2.5.1         compiler_4.2.0