Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations
A discussion of the paper by Wang et al. 2020
September 6, 2020
This week in Journal Club at the Mathieson Lab we discussed the recently published paper by Wang et. al (2020) on the theoretical aspects of the transferability of polygenic risk scores across ancestries.
This work provides three predictors for the relative accuracy of a polygenic risk scores (i.e., the accuracy in a test population with different ancestry than that from which the summary statistics came from,divided by the accuracy in an independent test sample of same ancestry as the original GWAS). The three different predictors differ in their assumptions. All of them assume that the causal variants are the same across populations and that their effect sizes are 100% correlated. Predictor 1 requires that one knows what the causal variants are, as well as their degree of linkage disequilibrium with the genome-wide significant SNPs. In practice that is rarelly if ever known. Predictor 2 approximates this by doing a heuristic approach to select candidate causal variants for each GWS SNP. Finally, Predictor 3 assumes (rather naively as the authors state) that the GWS SNPs are the causal SNPs. In summary, their three predictors capture the impact of LD and allelic frequencies on PRS performance.
Next, they simulate genotypes based on the UK Biobank data, exploring different heritability values and number of causal variants. With this, they explored different genetic architectures. They used the 1000 Genomes to input variants in the UKBB data, which they divived into EUR (European), AFR (African), EAS (East Asian), and SAS (South Asian) based on the proximity of each individual to the principal components generated by 1000 Genomes populations. They assingned effect sizes based on a normal distirbution with mean 0 and variance 1 minus the heritability.
They evaluated the relative accuracies (RA) for Predictors 1, 2 and 3. They compared those to the observed RA in the simulated genotypes. Generally, Predictors 1 and 2 were pretty close to the simulation-based observed RA, while Predictorr 3 tends to give overestimates. They also verified that using a different panel for imputation, different clumping thresholds and heritabilities and number of causal variants did not strongly affect RA, which decreases monotonically with distance from Europe, as previously shown.
Next, they used real sumamry statistics to construct PRSs for 8 traits, and tested the performance of these three predictors. Their main fidnings are that: 1) RA is higher with genetic proximity to Europe and 2) the loss of accuracy (LOA=1-RA) attributable to LD and allelic frequencies is highest in Africans. That is, for more genetically distant populations, differences in allelic frequencies play a more substantial role, while for more closely related populations other factors (not investigated) such as differences in effect sizes, gene by environment interactions, etc, presumably have a greater realtive improtantce. They authors say that they provide upper bounds for the proportion of LOA due to LD and allelic frequencies, which is useful as new studies are trying to understand and improve the transferability of PRS across ancestries.
My slides are available here. The preprint is short and technical and does not go through the trouble of explaining the models implemented in LDpred1 and LDpred2, since that has been described in great detail in the original publication. Rather, this seems to be a ‘Bioinformatics’ style preprint that goes straight to the novelties:
LDpred2 runs in R instead of in Python like its predecessor;
LDpred2 has four implementations (LDpred2-inf, LDpred2, LDpred2-sparse, LDpred2-auto), compared to two from its predecessor (LDpred-inf, LDpred);
LDpred2-auto estimates both p (proportion of causal variants) and h2 (SNP heritability) from the test data, without requiring a validadation set for hyperparameter optimization (this option is great, provided that appropriate QC is performed in the summary statistics);
LDpred2-sparse allows some some of the causal variants to actually have effect sizes of zero;
LDpred2 parallelizes both the grid of hyperparameters as well as the chromosomes (run in parallel) via C++, which makes it faster;
By changing how the LD radius is defined in the weight estimation (Gibbs sampler) step, it performs much better in regions of long-range LD such as the MHC region. Instead of defining a SNP window of aM\3000 in each direction (M is the number of causal variants, roughly 1Mb), it defines the window in terms of genetic distances and the default value is 3 cM;
Personally, I have spent quite some time figuring out the error messages that LDpred provides, as well as studying the models that it implements. Also, I am a big R fan, so it would be a shame to see all my efforts with LDpred go to waste: I am definitely switching to LDpred2, which runs 100% in R. The first author has kindly provided a tutorial, which I see as a big bonus.
TL;dr: If you are already using LDpred1 (i.e, you have spent time figuring out how it runs, as well as the underlying models), it is definitely worth switching to LDpred2. If you are just beginning your exploration of Polygenic Risk Scores, a less overwhelming approach would be to start with C+T (clumping and thresholding), which is very simple, fast, effective, and requires nothing too fancy. If you are already using another fancy approach such as lassosum, LDpred2 does not seem to do better than that, so I would stick with what is working for you.
I will write something about my experience running LDpred2 in the coming weeks.
- Posted on:
- September 6, 2020
- Length:
- 5 minute read, 891 words
- Categories:
- blog