What is a proper software to do GWAS analysis of tuberculosiis VCFs and phenotype data?

What is a proper software to do GWAS analysis of tuberculosis VCFs and phenotype data? Need a software which will accept VCF file and phenotype data as an input and produce genome wide association with a report generation.

Frontiers in Genetics

The editor and reviewers' affiliations are the latest provided on their Loop research profiles and may not reflect their situation at the time of review.

  • Download Article
    • Download PDF
    • ReadCube
    • EPUB
    • XML (NLM)
    • Supplementary
    • EndNote
    • Reference Manager
    • Simple TEXT file
    • BibTex


    Author summary

    mRNA molecules that encode for proteins end with a long stretch of adenosines, called poly(A) tail. The poly(A) tail contributes to the stability of the mRNA molecules, their translation to proteins and their import from the nucleus to the cytoplasm. The process of adding this tail to the mRNAs is called polyadenylation, and the termination site on the mRNAs at which the poly(A) tail is added is called the poly(A) site. In recent years it became evident that the vast majority of mRNAs of human genes contain several alternative poly(A) sites and their usage generates different mRNA isoforms that differ in their stability and translation efficiency. Therefore, alternative polyadenylation (APA) is emerging as a novel and important, yet underexplored, mechanism that regulate gene expression. The choice between alternative p(A) sites in an mRNA molecule is regulated by regulatory sequences located within a region in the mRNA called the 3’ untranslated region (3’UTR). A major challenge in present human genetics research is to understand how common genetic variants affect individuals’ health. In our study, we systematically identified dozens of genetic variants that affect the choice between alternative p(A) sites and demonstrated that by that, these variants influence the expression level of the target genes. Our results help to illuminate a novel mechanism by which genetic variants that are common in the population affect different traits including our risk for developing diseases.

    Citation: Shulman ED, Elkon R (2020) Systematic identification of functional SNPs interrupting 3’UTR polyadenylation signals. PLoS Genet 16(8): e1008977.

    Editor: Andreas Gruber, UNITED KINGDOM

    Received: September 24, 2019 Accepted: July 1, 2020 Published: August 17, 2020

    Copyright: © 2020 Shulman, Elkon. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

    Data Availability: All relevant data are within the manuscript and its Supporting Information files.

    Funding: R.E. Israel Science Foundation grant no. 2118/19 DIP German–Israeli Project cooperation (DFG RE 4193/1-1), Bernard Jacobson’s fund - TAU, ED.S Edmond J. Safra Center for Bioinformatics at Tel Aviv University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

    Competing interests: The authors have declared that no competing interests exist.


    Population demographic history

    The number of individuals culled annually, which was used as a proxy of population abundance, showed that there had been three population crashes over the last 20 years (Fig. 1). A total of 1186 animals were culled, 755 of them (63.7%) in the last 10 years (Table S1). With regard to TB prevalence, an increasing trend was observed over time (Fig. 1). In the first four seasons (2002/03 to 2005/06), an average of 45% of the individuals analyzed had TB, which further increased to 83% in the last three seasons (2009/10 to 2011/12). The wild boar population studied represented a uniform genetic cluster and evidenced a lack of hybridization with commercial/domestic pig breeds and northern European wild boar populations in both PCA (Fig. S1) and STRUCTURE (Fig. S2) analyses. Furthermore, there was no evidence of population substructure within the sampled wild boar population, both when comparing infected/uninfected individuals (FST = 0.00) and individuals from different time-periods (FST = 0.00) (Fig. S3). Genome inflation factor calculations also revealed an absence of population substructure within our sampled population (Fig. S4). The historical perspective of effective population size (Ne), calculated using the SNP data, showed a progressive decline in Ne in the past generations (Fig. 2).

    Plot showing the wild boar population abundance (dashed line) and tuberculosis prevalence (solid line) estimated for each season (number of individuals = #) throughout the monitored program implemented in the reserve and the sampling period, respectively. The three population crashes are also indicated.

    Historical trajectories of effective population size (Ne) of the wild boar population inferred from genomic data for the past generations.

    Genome-wide associations (GWAS), validation test and expression of candidate genes

    Genome-wide associations were conducted on individuals infected vs. uninfected with MTC and on individuals from the 2002/06 vs. 2009/12 time-periods. In each GWAS, we performed a standard case-control analysis and a stratified case-control analysis. In the latter analysis, we clustered the individuals by age class and time-period/TB outcome in order to account for their possible effects on statistical models. An empirical cut-off in the p-values distribution was assumed in GWAS (discovery stage) to select the highest differentiated SNPs, since none of the SNPs were significant after Bonferroni correction (p-values < 1.69E-06). The considered threshold (p-value < 1 × 10E-4), which represents the top 0.03% of the lowest p-values obtained, selected the eight highest differentiated SNPs for further validation (Fig. 3). In this analysis with a large dataset, some of these SNPs revealed statistically significant differences in allelic frequency between animal groups after Bonferroni correction (p-value = 6.25 × 10 −3 ). In addition, the p-values of GWAS and validation tests were combined, and the initial conservative p-value of 1.69E-06 was considered as a threshold of significance. And finally, some of the genes close to the differentiated SNPs were further investigated in a large dataset using RNA expression. These findings are detailed described in the following sections for each GWAS.

    Minor allele frequencies (MAF) differentiation for the singular nucleotide polymorphism (SNP) identified in genome-wide association (GWAS) analyses. MAF differences are shown between (a) time-periods (2002/06 vs. 2009/12) and (b) tuberculosis (TB) outcome (uninfected vs. infected). The location of each SNP on porcine genome assembly Sus scrofa 10.2, and the closest genes are also represented. The candidate genes selected for mRNA gene expression analyses are indicated in bold type.

    Infected vs. uninfected individuals with MTC

    The three SNPs (rs81423166, rs81388748 and rs80904044) with the highest divergent allelic frequencies (lower p-values) between MTC infected and uninfected individuals were initially selected from the classical GWAS (standard and stratified case-control analyses) (Fig. 4). When these SNPs were validated in a large dataset, the rs81388748 SNP was the unique that revealed a p-value below the considered threshold of significance (p-value < 6.25 × 10 −3 ). By combining the p-values of GWAS and validation test, the rs81423166 SNP was the only one that showed a p-value < 1.69E-06. The polymorphism variant (A) of this SNP had lower odds of having TB (OR = 0.235–0.230, combined result for the standard and stratified analyses, respectively) (Table 1). This SNP, located on chromosome 10 of the porcine genome assembly 10.2, is flanked by various genes, including the BDNF/NT-3 growth factor receptor (BDNF/NT-3) and the neurotrophic tyrosine kinase receptor, type 2 (NTRK2) (Fig. 3). mRNA expression analyses of the these genes revealed statistically significant differences in gene expression between time-periods (Table 2), although no significant associations were found between SNP variants and gene expression (Table S2). Regarding to the rs81388748 SNP, and despite no significant result was found in the combined tests, the polymorphism variant (A) had high odds of having TB (OR = 5.116–5.189) (Table 1). Among the closest genes to this SNP, only one had a known biological function, the immunoglobulin superfamily member 21 (IGSF21) (Fig. 3). The expression of IGSF21 gene was higher during 2002/06 (period of lower TB prevalence) than in 2009/12 (Table 2). In addition, the variant (A) had a significantly lower gene expression (mean = 0.270, 95% CI = 0.172–0.368) than the variant (C) (mean = 0.392, 95% CI = 0.305–0.479) (Table S2). Furthermore, a detailed expression analysis of the three previously described genes (IGSF21, BDNF/NT-3, NTRK-2) was performed considering the age class and time-period/TB outcome. This analyses revealed different gene expression patterns (Fig. S5 and Tables S3, S4 and S5), namely for BDNF/NT3 significant differences were observed for juvenile/adult and infected/uninfected individuals between the time-periods (up-regulated in 2002/06).

    Manhattan plot displaying the genome-wide results [−log10(P)] of the standard (a) and stratified (b) association analyses between uninfected and infected individuals with Mycobacterium tuberculosis complex (MTC).

    2002/06 vs. 2009/12 time-periods

    The five SNPs (rs81465339, rs81455206, rs81333725, rs81394585 and rs80966661) with the highest divergent allelic frequencies (lower p-values) between time-periods (2002/06 vs. 2009/12) were initially selected from the standard and stratified case-control analyses (Fig. 6). When these SNPs were validated in a larger dataset, the rs814665339, rs81394585 and rs80966661 SNPs displayed a p-value < 6.25 × 10 −3 (Table 1). By combining the p-values of GWAS and validation test, the rs814665339 and rs81394585 SNPs had a p-value below the considered threshold of significance (p < 1.69E-06). The rs81465339 SNP had the highest allele frequency difference, with the variant (A) being associated with lower odds (OR = 0.123–0.128) of belonging to 2002/06, period with lowest TB prevalence (Table 1). This SNP is closely flanked by LOC102164072 gene for which there is no information about its biological function (Fig. 5). On the other hand, the variant (A) of rs81394585 SNP, located near to CDH8, was associated with lower odds of belonging to 2002/06 (OR = 0.168–0.170) (Table 1). Finally, and despite no significant result was found in the combined tests, the variant (A) of the rs80966661 SNP, which is located within the ATP9A and near to NFATC2 genes, was associated with lower odds of belonging to 2002/06 (OR = 0.081–0.107). The mRNA expression levels varied significantly between time-periods for LOC102164072 and ATP9A genes (Table 2). LOC102164072 had higher levels of mRNA (up-regulated) during 2002/06 (period of lower TB prevalence) when compared with 2009/12, while the ATP9A gene had the reverse pattern (down-regulated in 2002/06 in comparison with 2009/12). Detailed analysis of gene expression by age class and time-period/TB outcome revealed different patterns (Fig. S5 and Tables S3, S4 and S5). While LOC102164072 gene showed significant differences between time-periods (up-regulated in 2002/06) for adults, the expression levels of ATP9A varied only for MTC infected adults (down-regulated in 2002/2006). Although no significant results were observed for rs81333725 SNP in the validation and combined tests, the levels of expression of the closest gene RXFP1 were higher during 2002/06 (period of lower TB prevalence) than 2009/12 (Table 2). The variant (C) of this SNP was associated with lower odds of belonging to 2002/06 (OR = 0.11, 95% CI: 0.03–0.45). Indeed, the variant (C) was significantly associated with a higher level of gene expression (mean = 1.057, 95%CI = 0.792–1.322) when compared with variant (A) (mean = 0.518, 95% CI = 0.349–0.687) (Table S2).

    Biological function of genes associated to SNPs with the highest allele frequency differences in the standard and/or stratified genome-wide analyses (GWAS).

    Manhattan plot displaying the genome-wide results [−log10(P)] of the standard (a) and stratified (b) association analyses between the 2002/06 and 2009/12 time-periods.


    Accurately predicting phenotypes from genotypes is a problem of high significance for biology that comes with great challenges for learning algorithms. Difficulties arise when learning from high dimensional genomic data with sample sizes that are minute in comparison 23 . Furthermore, the ability of experts to understand the resulting models is paramount and is not possible with most state-of-the-art algorithms. This study has shown that the CART and SCM rule-based learning algorithms can meet these challenges and successfully learn highly accurate and interpretable genotype-to-phenotype models.

    Notably, accurate genotype-to-phenotype models were obtained for 107 antimicrobial resistance phenotypes, spanning 12 eukaryotic species and 56 antimicrobial agents, which is an unprecedented scale for a machine learning analysis of this problem 19 . The obtained models were shown to be highly interpretable and to rely on confirmed drug resistance mechanisms, which were recovered by the algorithms without any prior knowledge of the genome. In addition, the models highlight previously unreported mechanisms, which remain to be investigated. Hence, the learned models are provided as Additional data with the hope that they will seed new research in understanding and diagnosing AMR phenotypes. A tutorial explaining how to visualize and annotate the models is also included.

    Furthermore, a theoretical analysis of the CART and SCM algorithms, based on sample compression theory, revealed strong guarantees on the accuracy of the obtained models. Such guarantees are essential if models are to be applied in diagnosis or prognosis 23 . To date, these algorithms are among those that perform the highest degree of sample compression and thus, they currently provide the strongest guarantees (in terms of a sample compression risk bounds) for applications to high dimensional genomic data. Moreover, it was shown that these guarantees can be used for model selection, leading to significantly reduced learning times and models with increased interpretability. This serves as a good example of how theoretical machine learning research can be transferred to practical applications of high significance.

    Finally, it is important to mention the generality of the proposed method, which makes no assumption on the species and phenotypes under study, except that the phenotypes must be categorical. The same algorithms could be used to predict phenotypes of tumor cells based on their genotype (e.g., malignant vs. benign, drug resistance), or to make predictions based on metagenomic data. To facilitate further biological applications of this work, an open-source implementation of the method, that does not require prior knowledge of machine learning, is provided with this work, along with comprehensive tutorials (see Methods). The implementation is highly optimized and the algorithms are trained without loading all the genomic data into the computer’s memory.

    Several extensions to this work are envisaged. The algorithms and their performance guarantees could be adapted to other types of representations for genomic variants, such as single nucleotide polymorphisms (SNP) and unitigs 45 . The techniques proposed by Hardt et al. 46 could be used to ensure that the models are not biased towards undesirable covariates, such as population structure 47,48 . This could potentially increase the interpretability of models, by avoiding the inclusion of rules that are associated with biases in the data. In addition, it would be interesting to generalize this work to continuous phenotypes, such as the prediction of minimum inhibitory concentrations in AMR 20 . Furthermore, another extension would be the integration of multiple omic data types to model phenotypes that result from variations at multiple molecular levels 49 . Additionally, this work could serve as a basis for efficient ensemble methods for genotype-to-phenotype prediction, such as random forest classifiers 50 , which could improve the accuracy of the resulting models, but would complexify the interpretation. Last but not least, the rule-based methods presented here assure good generalization if sparse sample-compressed classifiers with small empirical errors can be found. Nevertheless, it is known that good generalization can also be achieved in very high dimensional spaces with other learning strategies, such as achieving a large separating margin 51,52 on a large subset of examples or by using learning algorithms that are algorithmically stable 53 . Although it remains a challenge to obtain interpretable models with these learning approaches, they could eventually be useful to measure the extent to which the rule-based methods are losing predictive power at the expense of interpretability.

    4. Analytical approaches and methods for multi-omic association studies

    Existing studies in human and model organisms highlighted the complexity of genomic information flow and the interactive networks in biological processes and disease development. The multi-omics approach thus holds the promises to further advance human disease research. However, such enthusiasm can only be translated into scientific discoveries with sound study designs and solid analytical strategies.

    The ideal datasets for such an integrative analysis are multi-omics data all collected on the same set of samples. However, this is often not possible because of the cost or because the control samples simply do not have the appropriate tissues to study. Another type of datasets is multi-omics data collected on different sets of individuals from different studies. Different research questions can be answered for each type of multi-omic dataset using corresponding statistical approaches.

    4.1 Regression-based joint modeling

    The regression-based approach jointly models multi-omics data, using the framework of mediation analysis. These data are typically collected on the same subjects. Throughout this section, we let Y be the dichotomous disease outcome, G be a SNP or a set of SNPs depending on the specific method, E be the mRNA expression of a gene or a set of genes, and X be all non-genomic covariates (such as clinical or environmental measurements) with the first covariate being 1. In the following, we review four methods in this category.

    Huang, Vanderweele, and Lin (2014) developed a method that integrates SNP and gene expression data, treating gene expression as the mediator in the causal mechanism from SNPs to the disease outcome ( Figure 2 ). They used a logistic regression model

    to characterize the dependence of the disease outcome on a set of SNPs G, the expression E of a gene, and other covariates X. A SNP-expression pair can be defined in two ways. First, one can choose the SNPs mapped to a gene and the expression of the gene. Second, one can choose the eQTL SNPs and the corresponding gene expression based on an eQTL study. The dependence of the gene expression on the set of SNPs and other covariates is formulated through a linear regression model

    The goal is to test the hypothesis

    This null hypothesis can be interpreted within the framework of causal mediation analysis based on the causal diagram in Figure 2 . Define the total effect (TE) of the set of SNPs on the disease outcome as

    in which both probabilities are marginalized over E. The TE of SNPs can be decomposed into the direct effect (DE) and the indirect effect (IE). The DE is the effect of the SNPs on the disease outcome that is not through gene expression, whereas the IE is the effect of the SNPs that is mediated through the gene expression. When the SNPs are associated with the gene expression (i.e., eQTL SNPs αG ≠ 0), the null hypothesis (3) is equivalent to the null hypothesis of DE = 0 and IE = 0 (i.e., no TE of the SNPs). When the SNPs have no effect on the gene expression (i.e., not eQTL SNPs αG = 0), then there is no IE of the SNPs on Y, so that the null hypothesis (3) is not equivalent to testing for no TE, but simply whether there exists a joint effect of the SNPs, the gene expression, and possibly their interactive effect on the disease risk. This causal interpretation is helpful for understanding genetic etiology of diseases as well as for applications in pharmaceutical research (Y. Li, Tesson, Churchill, & Jansen, 2010).

    Gene expression is a potential mediator of genetic effects on the disease outcome.

    As the number of SNPs in G may be large and some SNPs may be highly correlated with each other due to linkage disequilibrium (LD), the standard likelihood ratio test (LRT) or multivariate Wald test for the null hypothesis (3) would use a large number of degrees of freedom and would thus have limited power. To overcome this problem, Huang et al. (2014) proposed a variance component test. They assumed that the components in the vector βG are independent and follow an arbitrary distribution with mean 0 and variance τG, and that the components in βGE are independent and follow an arbitrary distribution with mean 0 and variance GE. The disease model (1) hence becomes a logistic mixed-effect model, and the test of hypothesis (3) becomes a joint test of variance components and a scalar regression coefficient:

    Therefore, the proposed variance component test is insensitive to the number of SNPs in G. As the true disease model is unknown and can be different from (1), e.g., without the interaction term, Huang et al. (2014) further proposed an omnibus test that accommodates different possible disease models.

    Later, Huang (2015) extended the work of Huang et al. (2014) to jointly analyze SNP, DNA methylation, and gene expression data with respective to a disease outcome, adding the layer of DNA methylation data to the existing framework. In addition, the earlier work only focused on testing the overall effect of a set of SNPs and the expression of a gene, without distinguishing the mechanisms of DE of the SNPs on the disease and IE of the SNPs mediated by the expression. In the later work, Huang (2015) studied path-specific effects, as depicted in the causal diagram ( Figure 3 ), by jointly modeling a set of SNPs within a gene, the DNA methylation and expression of the gene, and the disease outcome as a biological process. Let M denote the DNA methylation measurement at a CpG site. Then the logistic model in (1) is expanded as

    The dependence of the DNA methylation on the set of SNPs and other covariates and the dependence of the gene expression on the SNPs, DNA methylation, and other covariates are specified in the linear regression models

    F E | M ( 0 , σ E | M 2 ) , and FM and FE|M are any arbitrary distributions.

    Three path-specific effects are 1) Direct effect of SNPs on outcome (dashed red line), 2). Indirect effect of SNP mediated through gene expression but not through methylation (dotted blue lines), and 3). Indirect effect of SNP mediated through methylation (solid black lines).

    An arbitrary set of regression coefficients in model (4) can be tested. For example,

    can be assessed by a variance component test as proposed in Huang et al. (2014). To provide a mechanistic interpretation of the hypothesis (7), Huang (2015) first decomposed the overall genetic effect into three path-specific effects: 1) the DE of the SNPs on the outcome, not through the DNA methylation or the expression (denoted by ΔGY), 2) the IE of the SNPs on the outcome that is mediated through the gene expression but not through the DNA methylation (ΔGEY), and 3) another IE of the SNPs on the outcome that is mediated through the DNA methylation (ΔGMY). Within the causal mediation framework, the correspondence of a path-specific effect and a set of regression coefficients in the disease model (4) can be established. For example, the DE ΔGY corresponds to βG, βGM, βGE, and βGME, which is not influenced by the relationship among G, M, and E. By contrast, the IE ΔGEY of SNPs mediated through expression is affected by the G-M-E relationship. Evidently, if there does not exist an effect of G on E, ΔGEY is zero. If there exists an effect of G on E, ΔGEY corresponds to βE, βME, βGE, and βGME it means that the test of the hypothesis (7) is equivalent to the test of the IE of SNPs mediated through gene expression. To determine the relationship among G, M, and E, one can rely on prior knowledge of existing biological evidence, or statistical analyses that estimate the relationship, or model selection criteria such as Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978).

    To apply this method to the genome-wide data, it is unclear how to select the DNA methylation measurement for a gene. It is possible to consider each of the CpG sites that map to the gene including the upstream and downstream of the gene, but this strategy will result in too many tests. The data application of Huang (2015) does not illustrate this point. Instead, the application concerns 12 methylation loci, a micro-RNA expression, and a gene expression, substituting a set of methylation loci for the set of SNPs in the methodology and substituting a micro-RNA expression for the DNA methylation.

    While Huang et al. (2014) and Huang (2015) jointly analyze multi-omics data from the same subjects, Huang (2014) extended the methodologies to analyze the data from different subjects. This is motivated by the fact that the GWAS and QTL studies are likely to be conducted in different subjects due to the availability of tissue samples and the tissue specificity of expression and DNA methylation. Specifically, in GWAS, SNPs and the disease outcomes are collected, but not gene expression/methylation in QTL studies, SNPs, gene expression and methylation are collected, but not the disease outcome. Define μM = E(M | X, G), μE = E(E | X, G) and μME = E(ME | X, G). From expression (5), we have μM = G T δG + X T δX. The μE and μME can be obtained by marginalizing (6) over M. With different omics data on different subjects, the only testable effect is the overall SNP effect on the disease outcome, not any of the path-specific effects. In the statistic of the corresponding variance component test developed in Huang (2015), the M, E and ME terms should be replaced by the estimated μM, μE and μME, respectively. Thus, the testing procedure in Huang (2015) can be applied in settings where methylation and/or expression data are not collected in the subjects of GWAS but their associations with SNPs (i.e., μM, μE and μME) can be consistently estimated from external meQTL and eQTL studies. Note that the meQTL and eQTL studies should be conducted on the same subjects in order to calculate μME.

    Zhao et al. (2014) considered the same omic datasets that Huang et al. (2014) have dealt with, i.e., SNP, gene expression, and disease data collected on the same set of subjects. However, Zhao et al. (2014) focused on testing the IE of the SNPs on the disease outcome that is regulated by gene expression. They proposed the following two-stage model for each SNP G,

    where E may include the expression for a set of genes. Model (9) is significantly different from model (2) in that the former does not consider the regulation of the SNP on the expression of an individual gene, but on one particular linear combination of them it hence requires estimating fewer parameters. Note that this is the same linear combination of gene expression in the disease model (8). Based on the two-stage model, one can test for SNP-disease association by testing H0: αG = 0, assuming that the SNP affects disease risk through affecting the gene expression levels. This work is analogous to the work by Huang et al. (2014) but focuses solely on increasing the power of SNP association testing, rather than on assigning causal interpretations to any of the parameters. When a particular set of genes or a pathway is of interest such that the number of genes in E does not exceed the number of subjects, Zhao et al. (2014) proposed to use the standard estimating equation theory for inference. To apply their method in an agnostic, genome-wide manner, they proposed to consider one gene in E at a time to reduce the multiple testing burden imposed by the huge number of pairwise tests they proposed to restrict to testing only those SNPs located cis to each gene. This method works best when there is no DE of the SNPs on the disease outcome, such that the SNPs act only through regulating gene expression. In this case, the gene expression can help explain the variability of the SNP effect on disease and thus increases the power of detecting the overall effect of SNPs on disease. Indeed, Kenny and Judd (2014) noted that in the absence of a DE, testing the IE in a mediation analysis can be dramatically more powerful than the standard method testing SNP-disease associations directly. Even in the presence of a DE so that model (8) mis-specifies the true disease risk, Zhao et al. (2014) showed, both analytically and numerically, that their method is still more powerful than the standard method when the magnitude of DE is lower than the magnitude of IE.

    4.2 Matching Patterns of eQTL and GWAS

    He et al. (2013) developed a method to detect disease-associated genes (i.e., genes whose expression level influences the disease risk) by matching the eQTL patterns of each gene with the patterns of disease-associated SNPs. This method is especially useful when eQTL and GWAS studies were conducted on different samples. The rationale is that, for a disease-associated gene, any genetic variation that perturbs its expression is also likely to influence the disease risk ( Figure 4 ). Thus, the eQTLs of the gene, which constitute a unique “genetic signature” of this gene, should overlap significantly with the set of loci associated with the disease. Because many eQTLs act in trans, this approach can identify important genes that are distal to any GWAS association signals and thus impossible to be detected with GWAS alone.

    Ui: binary indicator variables to represent the true SNP-gene expression causal relationship, Vi: binary indicator variables for the true SNP-disease relationship. Z is a binary variable indicating whether the gene expression trait influences the disease.

    He et al. (2013) implemented the above idea of genetic signature matching by a Bayesian framework. Suppose that, given a gene, there are m putative eQTLs that pass some low, less stringent significance threshold in the eQTL study. Let Uj and Vj be binary indicators to represent whether the ith SNP is associated with the expression and the disease outcome, respectively. Let Z be a binary indicator that represents whether the expression of the gene is associated with the disease. If, for a significant number of SNPs, Uj = 1 and Vj = 1, then it is likely that Z = 1. The available data consist of the p-values of SNPs relative to the gene expression from an eQTL study, denoted by the vector peQTL, and the p-values of the SNPs relative to the disease outcome from a GWAS, denoted by the vector pGWAS. Although Uj and Vj are not observed, they are related to peQTL,j and pGWAS,j: when peQTL,j (pGWAS,j) is small, it is likely that Uj (Vj) = 1. Thus, the data peQTL and pGWAS can be used to test the hypothesis H0: Z = 0 that the gene is not associated with the disease. The inference of Z is based on the Bayes factor (BF):

    which is the ratio of the probabilities of data under H1 and H0. When all SNPs are unlinked, the BF of the gene is the product of the BFs of all SNPs:

    When there is LD among SNPs, He et al. (2013) proposed to use a block-level BF, which is the mean of the BFs of all SNPs in that block (Servin and Stephens, 2007). The probability P(peQTL,j, pGWAS,j | Z) is computed by summing over the hidden variables Uj and Vj:

    The components on the right hand side are specified as follows. Uj is a Bernoulli variable with the success probability α, which is the prior probability of a SNP being associated with the gene expression. He et al. (2013) chose α = 1.0 × 10 𢄣 for cis-eQTLs (within 1Mb of the gene) and α = 5.0 × 10 𢄥 for trans-eQTLs. When Z = 0, the gene is irrelevant to the disease and thus Uj and Vj are independent. When Z = 1 and Uj =0, this SNP is not an eQTL and thus Uj and Vj are also independent. In both cases, Vj is a Bernoulli variable with the success probability β, which is the prior probability of a SNP being associated the disease. He et al. (2013) chosen β = 1.0 × 10 𢄣 . When Z = 1 and Uj = 1, Vj should always be 1, as a true eQTL of the gene is expected to be associated with the disease. The probabilities P(peQTL,j | Uj) and P(pGWAS,j | Vj) reflect the distributions of p-values under the null or alternative hypothesis. Let TeQTL,j and TGWAS,j be the test statistics corresponding to peQTL,j and pGWAS,j, respectively. Under the null, P(TeQTL,j | Uj = 0) and P(TGWAS,j | Vj = 0) follow the standard normal distribution. Under the alternative, P(TeQTL,j | Uj = 1) and P(TGWAS,j | Vj = 1) depend on the tests through which the test statistics are derived and the effect size of the SNP. Finally, the BF of the jth SNP, Bj, can be expressed as

    are BFs measuring the association of the j th SNP with the expression and the disease, respectively. Thus the BF of the gene being tested depends only on α, β, and SNP-level BFs. (If Bayesian inference has been performed in both the eQTL and GWAS analysis, it is straightforward to combine the resulting BFs to obtain the BF for the gene.) To assess the statistical significance of BF, a simulation approach was proposed to compute the p-value of the BF for a gene.

    Because this method does not directly test the relationships between genotypes, gene expression, and disease outcomes, but only requires p-values, the eQTL and GWAS data do not have to come from the same subjects. This method is also generalizable to molecular traits other than gene expression, such as metabolites, non-coding RNAs, and epigenetic modifications. It has been implemented in a software program called Sherlock. The name implies that the method works as a detective, comparing the fingerprint from a crime scene (the GWAS signature) against a database of fingerprints (the eQTL signature of all genes) to determine the real culprit (disease-associated genes).

    4.3 Aggregating evidence of multi-omics data over gene set/pathway

    Xiong et al. (2012) developed a statistical framework, called Gene Set Association Analysis (GSAA), that aggregates genetic and gene expression evidence in terms of 𠇊ssociation scores” at the level of gene sets or pathways for genome-wide association analysis of gene sets or pathways. The gene expression data and the SNP genotype data are allowed to be collected on the same samples or different samples. The dashed box in Figure 5 illustrates the three-step aggregation procedure of GSAA without consideration of DNA methylation sites, proteins and metabolites.

    An omnibus test of pathways enriched for trait-associated SNPs, gene expressions, CpG sites, proteins and metabolomic features. This multi-layer approach allows aggregation of single association signals from individual markers to genes to pathways. The original aggregation model limited to SNPs and gene expression levels within the dashed box.

    First, the SNP set association score and the gene expression association score are calculated respectively. The gene expression association score that reflects the degree to which a gene is differential expressed between cases and controls is calculated as the difference of the group means scaled by the standard deviation. The SNP set association score is the maximum of the single-SNP score over all the SNPs mapped to the gene region, where the single-SNP score is calculated as the genotype- or allele-based χ 2 statistic and the gene region is a pre-defined genomic interval encompassing the gene and the upstream of and downstream from the transcribed region.

    Second, the SNP set association score and the gene expression association score are combined to generate a gene association score. This step integrates evidence for association across the gene expression and SNP data. Before the integration of expression and SNP data, the absolute values of the gene expression scores are taken in order to capture both up-regulation and down-regulation in pathways and to be consistent with the magnitude of the SNP set association scores. Both gene expression score and SNP set score are also standardized by the mean and standard deviation of its respective null distributions, which are generated by a phenotype-based permutation procedure, so that the scores from different statistical tests or on different scales are brought on a common scale and thus directly comparable with each other. The gene association score is the sum of the two standardized scores.

    Third, the gene set is evaluated by a weighted Kolmogorov-Smirnov (K-S) statistic (i.e., gene set association score) to determine whether the genes belonging to this gene set are preferentially near the top of the ranked ordered list based on gene association scores. Based on a phenotype-based permutation procedure that preserves LD structure in SNP data and gene-gene correlation structure in gene expression data, the false discovery rate (FDR) or the family-wise error rate (FWER) can be calculated and the significant gene sets are declared controlling for FDR or FWER below a certain threshold.

    Although Xiong et al. (2012) only focused on integrating the gene expression and SNP genotype data, the flexibility of this framework allows integration of other omic data such as DNA methylation, proteomics, and metabolomics data ( Figure 5 ). Analogous to the SNP set association score, we can first calculate the χ 2 statistic at single CpG sites based on the beta values (measuring DNA methylation level) and then obtain a CpG-set association score for the gene using a maximum statistic. We can also calculate the χ 2 statistic at each protein. These statistics are aggregated into the gene association score after proper standardization, along with those for SNP sets and gene expression. Finally, we perform a weighted K-S test for metabolites within each pathway to obtain a metabolite-set association score. The pathway association scores are the sums of the gene- and metabolite-set association scores.



    All thirteen individuals analysed in this study are members of the Cultural Bubi Association of Fuenlabrada, Madrid (Spain). We obtained informed consent from all subjects. We discarded 25 of the interviewed individuals because of admixed ancestry many of them had a recent Fang ancestor from the mainland. Even though most of the individuals were not born in Bioko, we verified that the selected individuals had all grandparents born in the island many of the volunteers’ direct ancestors come from Malabo, Bariobé and Baney, which are located in the Northeast region of Bioko (Additional file 1: Table S1).

    Extraction, sequencing and mapping

    We isolated DNA from cotton swabs using all the available material and an organic-based DNA extraction method adapted to Amicon® Ultra 0.5-mL columns [45]. After extraction, we concentrated the DNA by centrifugation up to 50 μL and subjected samples to a quality control. To ensure there was a proper DNA concentration, 1 μL of sample was loaded in a 1% agarose gel and stained with ethidium bromide. Only a single band was observed. The samples were quantified with BioTek’s Epoch and yielded values, on average, of 68.88 ng/μL.

    Genomic DNA libraries were prepared using TruSeq DNA PCR-Free Library Preparation Kit (in accordance with the general settings of the preparation guide). The procedure produced a PCR-free library with 350 bp average insert size that requires 20 ng/ul (in 50 ul samples). DNA samples were randomly fragmented by Covaris system and sequenced in HiSeqX10 (Illumina) with hiseq2x150bp settings plus 65 bp paired-end adapters at Macrogen (South Korea).

    We evaluated the paired-end sequenced reads with FASTQc to check their quality. The sequencing adapters were then removed using Adapter removal [46], reads shorter than 30 bp were removed, and the reads were mapped against the Human reference genome [National Center for Biotechnology Information (NCBI) 37, hg19] using Burrows-Wheeler Aligner (BWA) with default parameters [47]. Duplicated reads were removed using Picard Tools MarkDuplicates version 2.8.3 and low quality mapping reads (< 30) were removed with SAMtools version 1.623 [48].

    Genotyping and quality control tests

    Unique aligned reads were processed with Base Quality Score Recalibration (BQSR) implemented in the GATK version 3.7 software [49]. Even if the plots did not show signals of systematic errors, we applied recalibration to all filtered reads. We used GATK HaplotypeCaller in GVCF mode for scalable variant calling (using the GRCh37 as a reference sequence). Individual variant calls were merged in a single VCF file using GATK genotypeGVCFs tool, and the variants were filtered using Variant Quality Score Recalibration (VQSR) with a filter level of 99%. We used QD, MQ ReadPosRankSum, FS, and SOR annotations in this step. We excluded any variant with less than 70% of the main depth coverage or more than 200%. We also removed those variants exhibiting qualities below 30. We removed those called variants with a minimum allele frequency below 0.05 and a of Hardy-Weinberg disequilibrium p-value below 1e-6.

    Population genetics dataset

    We merged our filtered variants with 690,739 SNPs from 1235 genotyped individuals belonging to 35 Western Africa populations. This dataset includes: Bantu-speaking populations, hunter-gatherers and Western African groups [13],using Plink 1.9 [50] (Additional file 1: Table S2). We excluded triallelic sites, A/T and C/G mutations and all sites with a minor allele frequency (MAF) below 0.05. We subsequently removed positions with > 10% missing data and those individuals with > 5% missing values. To ensure that genotypes were properly called after merging the dataset, the Yoruba SNP genotypes were compared against the 1000 Genomes Yoruba population. However, subsequent analyses were performed only with the Yoruba genotypes from Western Africa dataset [51]. Positions that exhibited > 0.2 values of pairwise Fst between both samples were also removed. Based on the colonial history of Bioko, we have assessed the presence of potential genetic admixture of the Bubi with Spanish individuals, adding Iberian samples from 1000 Genomes [52] to the SNP dataset. After this procedure, we again removed positions with MAF below 0.05, missing data above 0.1 and Hardy-Weinberg disequilibrium p-values below 1e-10.

    For most of the analyses, we have extracted a sub-dataset with representative populations from Western and Central Africa. This reduced dataset includes 14 populations and 169 individuals (Additional file 1: Table S3). Some of the population genomics analysis require an unrelated outgroup to the tested populations. We have merged our genotypes with data of eleven San individuals [53] from the Human Origins array [54], followed with the same merging procedure previously detailed. The resulting African dataset –including the Bubi- comprises 130,647 SNPs present in 1259 individuals.

    Mitochondrial (mt) DNA and Y-chromosome analysis

    Reads were mapped against the Revised Cambridge Reference Sequence (rCRS) of the human mtDNA [55]. After calling variants with GATK version 3.7 [49] as has been previously described, the mtDNA haplogroups were predicted using Haplogrep version 2 [56]. Y chromosome haplogroups were predicted by classifying the observed mutations according to the PhyloTree database [57].

    Population genomic analyses

    To situate the Bubi within the present diversity of the Gulf of Guinea and Western Africa, a principal components analysis (PCA) with the reduced dataset was generated using EIGENSOFT software [58], Results were plotted using R package ggplot2 [59, 60].

    ADMIXTURE plots were generated to estimate the proportions of K ancestral components on each individual genome [61] of the reduced dataset. As the analysis assumes linkage disequilibrium (LD), we pruned the dataset. We used Plink 1.9 to remove SNPs with an LD > r 2 = 0.5 in windows of 50 SNPs. ADMIXTURE analyses were performed with K from 2 to 15 and were repeated five times. The ADMIXTURE iterations were consolidated using CLUMPP with the large K greedy algorithm [62] and the results were plotted using R package pophelper [63].

    Outgroup f3 statistic is a useful test to determine the closest population to a target one using one outgroup population and measuring the amount of shared genetic drift with a test population. San were selected as outgroup, as they represent the most distant African population with genome-wide data, Bubi population was compared to all other populations in the dataset. The f3 (San Bubi, Test) statistic was calculated with popstats [64] and the results were again plotted using R. f statistics can also be implemented in order to determine which populations exhibited the highest genetic drift with the Bubi people, to do so, we used the popstats software to compute the f4 statistic (Test, San Bubi, Mbuti), (Test, San Bubi, Baka), (Test, San Bubi, Yoruba), (Test, San Bubi, Fang). These combinations allow us to dissect the genetic admixture of the tested populations with the Bubi in relation to all the representative source of genetic ancestry in Western Africa: Eastern RHG, Western RHG, Western-African populations and Bantu-speaking populations.

    The fixation index (Fst) is a measure of population differentiation. We calculated the mean pairwise Fst values between all the populations present in the global dataset. All autosomal SNPs were included in this analysis using the approach of Cockerham and Weir integrated in Plink 1.9 [65]

    The reduced dataset was phased with SHAPEIT2 [66], using 500 states, 50 MCMC main steps, 10 burn-in and 10 pruning steps recombination maps were interpolated from the HapMap phase 2 genetic maps. After excluding all positions with at least one missing site, we ended up with a dataset of 491,203 variable positions with no missing data.

    We used CHROMOPAINTER to build a coancestry matrix based on haplotype data from the phased-reduced dataset. This software estimates the admixture proportions in recipient chromosomes by painting the proportion of each genetic component from the donor populations. We ran CHROMOPAINTER with linked data, estimating n and M parameters through an observation run with no prefixed parameters and including 30 randomly selected samples and three randomly selected chromosomes fineSTRUCTURE analysis was performed with the counts obtained in CHROMOPAINTER and ran with 1000,000 Markov chain Monte Carlo (MCMC) iterations and output printed every 10,000 iterations. The best tree was calculated with 10,000 state attempts. We also generated a haplotype-based PCA with fineSTRUCTURE.

    To identify any admixture events between Bubi ancestors and other populations during the last 4500 years, we used the GLOBETROTTER [41] software on the basis of the defined clusters from fineSTRUCTURE (Additional file 1: Table S4).

    Identity by descent (IBD) analysis

    Identity by descent (IBD) blocks are defined as identical chromosome fragments present in multiple individuals that have been inherited from the same ancestral chromosome [67]. We have used RefinedIBD software [68] setting “ibdcm” = 0.5, “ibdtrim” = 62, “ibdwindow” = 2478, and “overlap” = 413 the rest of the parameters were assigned by default. All IBD blocks longer than five centimorgans (cM) were kept and the statistical threshold marked by LOD (the base 10 log of the likelihood ratio of the IBD segments, which is a figure that will depend of the size of the database and the genetic diversity within it) was assigned by default (> 3). The number of SNPs used here was 685,382. We then filtered the IBD segments to keep only those that were shared by any Bubi and another individual of the dataset (including the IBD fragments shared by two Bubi individuals). To reduce the impact that the population size could have on the global counts of IBD blocks per population, we corrected the value of the shared IBD fragments (IBDn) by the population size (t). In order to obtain the average of the IBD blocks shared by any Bubi with any other individual or population, we divided each number obtained in the previous step by the number of Bubi individuals, 13:ratioBubi_pop = (IBDn/t)/13.

    Runs of homozygosity (ROHs) analysis

    ROHs (> 1000 kb) were estimated with Plink software. First, we calculated the average (in kilobases) of the genome that it is in homozygosis for each population. Second, we calculated the average of the number of genomic fragments that are in homozygosis for each population.

    Malaria resistance

    Relevant mutations associated with malaria resistance in 10 different genes (Additional file 1: Table S11) – as found in genome-wide association studies (GWAS) and other previous studies [31, 69] – were genotyped in Bubi and the 1000 Genomes African populations. Fisher’s exact test was used to determine the statistical significance of the observed differences (p < 0.001).

    Evaluation of the effects of limited sample size

    We have used a whole genome Fst approach to evaluate the effects of the small sample size used in this work. We have randomly grouped the 186 Yoruba individuals from 1000 Genomes in 14 subsamples of 13–17 individuals and wehave estimated the mean pairwise Fst values among all population combinations. All autosomal SNPs were included in this analysis using the approach of Cockerham and Weir integrated in Plink 1.9 [65]. No comparison has shown values of mean pairwise Fst higher than 0.1, which indicates that the sub samples do not show significant differences in terms of genetic diversity (Additional file 2: Figure S12). This result suggests that the limited Bubi sample size can be used to infer genetic diversity at a higher population level.


    Cardiovascular disease (CVD), which ultimately damages heart muscle, is a leading cause of death worldwide (WHO, 2018). CVD encompasses a range of pathologies including myocardial infarction (MI), where ischemia or a lack of oxygen delivery to energy-demanding cardiomyocytes results in cellular stress, irreparable damage, and cell death. Genome-wide association studies (GWAS) have identified hundreds of loci associated with coronary artery disease (Nikpay et al., 2015), MI, and heart failure (Shah et al., 2020), indicating the potential contribution of specific genetic variants to disease risk. Most disease-associated loci do not localize within coding regions of the genome, often making inference about the molecular mechanisms of disease challenging. That said, because most GWAS loci fall within non-coding regions, these variants are thought to have a role in regulating gene expression. One of the main goals of the Genotype-Tissue Expression (GTEx) project has been to bridge the gap between genotype and organismal level phenotypes by identifying associations between genetic variants and intermediate molecular level phenotypes such as gene expression levels (GTEx Consortium et al., 2017). The GTEx project has identified tens of thousands of expression quantitative trait loci (eQTLs) namely, variants that are associated with changes in gene expression levels, across dozens of tissues including ventricular and atrial samples from the heart. However, the eQTLs reported by GTEx explain a modest proportion of GWAS loci, and while increasing the diversity of tissues and sample sizes will enable further insight, orthogonal approaches also need to be considered.

    It is becoming increasingly evident that many genetic variants that are not associated with gene expression levels at steady state, may be found to impact dynamic programs of gene expression in specific contexts. This includes specific developmental stages (Cuomo et al., 2020 Strober et al., 2019), or specific exposure to an environmental stimulus such as endoplasmic reticulum stress (Dombroski et al., 2010), hormone treatment (Maranville et al., 2011), radiation-induced cell death (Smirnov et al., 2012), vitamin D exposure (Kariuki et al., 2016), drug-induced cardiotoxicity (Knowles et al., 2018), and response to infection (Alasoo et al., 2018 Barreiro et al., 2012 Çalışkan et al., 2015 Kim-Hellmuth et al., 2017 Manry et al., 2017 Nédélec et al., 2016). The studies of context-specific dynamic eQTLs highlight the need to determine the effects of genetic variants in the relevant environment. Therefore, if we are to fully understand the effects of genetic variation on disease, we must assay disease-relevant cell types and disease-relevant perturbations. Most of the aforementioned studies were performed in whole blood or immune cells, which means that there are many cell types and disease-relevant states that have yet to be explored.

    With advances in pluripotent stem cell technology, we can now generate otherwise largely inaccessible human cell types through directed differentiation of induced pluripotent stem cells (iPSCs) reprogrammed from easily accessible tissues such as fibroblasts or B-cells. One of the advantages of iPSC-derived cell types as a model system is that the environment can be controlled, and thus we can specifically test for genetic effects on molecular phenotypes in response to controlled perturbation. This is particularly useful for studies of complex diseases such as CVD, which result from a combination of both genetic and environmental factors.

    The heart is a complex tissue consisting of multiple cell types, yet the bulk of the volume of the heart is comprised of cardiomyocytes (Donovan et al., 2019 Pinto et al., 2016), which are particularly susceptible to oxygen deprivation given their high metabolic activity. iPSC-derived cardiomyocytes (iPSC-CMs) have been shown to be a useful model for studying genetic effects on various cardiovascular traits and diseases, as well for studying gene regulation (Banovich et al., 2018 Benaglio et al., 2019 Brodehl et al., 2019 Burridge et al., 2016 de la Roche et al., 2019 Ma et al., 2018 McDermott-Roe et al., 2019 Panopoulos et al., 2017 Pavlovic et al., 2018 Ward and Gilad, 2019).

    In humans, coronary artery disease can lead to MI (Dzau et al., 2006) which results in ischemia and a lack of oxygen delivery to energy-demanding cardiomyocytes. Given the inability of cardiomyocytes to regenerate, this cellular stress ultimately leads to tissue damage. Advances in treatment for MI, such as surgery to restore blood flow and oxygen to occluded arteries, have improved clinical outcomes. However, a rapid increase in oxygen levels post-MI can generate reactive oxygen species leading to ischemia-reperfusion (I/R) injury (Giordano, 2005). Both MI and I/R injury can thus ultimately influence the amount of damage in the heart. iPSC-CMs allow us to mimic the I/R injury process in vitro by manipulating the oxygen levels that cardiomyocytes are exposed to in vivo.

    We thus designed a study aimed at developing an understanding of the genetic determinants of the response to a universal cellular stress, oxygen deprivation, in a disease-relevant cell type, mimicking a disease-relevant process. To do so, we established an in vitro model of oxygen deprivation (hypoxia) and re-oxygenation in a panel of iPSC-CMs from 15 genotyped individuals (Banovich et al., 2018). We collected data for three molecular level phenotypes: gene expression, chromatin accessibility, and DNA methylation to understand both the genetic and regulatory responses to this cellular stress. This framework allowed us to identify eQTLs that are not evident at steady state, and assess their association with complex traits and disease.



    Regular fish and omega-3 consumption may have several health benefits and are recommended by major dietary guidelines. Yet, their intakes remain remarkably variable both within and across populations, which could partly owe to genetic influences.


    To identify common genetic variants that influence fish and dietary eicosapentaenoic acid plus docosahexaenoic acid (EPA+DHA) consumption.


    We conducted genome-wide association (GWA) meta-analysis of fish (n = 86,467) and EPA+DHA (n = 62,265) consumption in 17 cohorts of European descent from the CHARGE (Cohorts for Heart and Aging Research in Genomic Epidemiology) Consortium Nutrition Working Group. Results from cohort-specific GWA analyses (additive model) for fish and EPA+DHA consumption were adjusted for age, sex, energy intake, and population stratification, and meta-analyzed separately using fixed-effect meta-analysis with inverse variance weights (METAL software). Additionally, heritability was estimated in 2 cohorts.


    Heritability estimates for fish and EPA+DHA consumption ranged from 0.13–0.24 and 0.12–0.22, respectively. A significant GWA for fish intake was observed for rs9502823 on chromosome 6: each copy of the minor allele (FreqA = 0.015) was associated with 0.029 servings/day (

    1 serving/month) lower fish consumption (P = 1.96x10 -8 ). No significant association was observed for EPA+DHA, although rs7206790 in the obesity-associated FTO gene was among top hits (P = 8.18x10 -7 ). Post-hoc calculations demonstrated 95% statistical power to detect a genetic variant associated with effect size of 0.05% for fish and 0.08% for EPA+DHA.


    These novel findings suggest that non-genetic personal and environmental factors are principal determinants of the remarkable variation in fish consumption, representing modifiable targets for increasing intakes among all individuals. Genes underlying the signal at rs72838923 and mechanisms for the association warrant further investigation.

    Citation: Mozaffarian D, Dashti HS, Wojczynski MK, Chu AY, Nettleton JA, Männistö S, et al. (2017) Genome-wide association meta-analysis of fish and EPA+DHA consumption in 17 US and European cohorts. PLoS ONE 12(12): e0186456.

    Editor: Philipp D. Koellinger, Vrije Universiteit Amsterdam, NETHERLANDS

    Received: July 9, 2015 Accepted: September 14, 2017 Published: December 13, 2017

    This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

    Data Availability: The meta-analysis results from this study are available at dbGAP (accession number phs000930).

    Funding: The Atherosclerosis Risk in Communities (ARIC) study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts N01‐HC‐55015, N01‐HC‐55016, N01‐HC‐55018, N01‐HC‐55019, N01‐HC‐55020, N01‐HC‐55021, N01‐HC‐55022, R01HL087641, R01HL59367 and R01HL086694 National Human Genome Research Institute contract U01HG004402 and National Institutes of Health contract HHSN268200625226C. The authors thank the staff and participants of the ARIC study for their important contributions. Infrastructure was partly supported by Grant Number UL1RR025005, a component of the National Institutes of Health and NIH Roadmap for Medical Research. Dr. Nettleton was supported by a K01 from the National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases (5K01DK082729-02). The Cardiovascular Health Study (CHS) research reported in this article was supported by contracts HHSN268201200036C, HHSN268200800007C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086, and grant U01HL080295 from the National Heart, Lung, and Blood Institute, with additional contribution from the National Institute of Neurological Disorders and Stroke. Additional support was provided by R01AG023629 from the National Institute on Aging. A full list of principal CHS investigators and institutions can be found at DNA handling and genotyping was supported in part by National Center for Research Resources grant M01RR00069 to the Cedars‐Sinai General Clinical Research Center Genotyping core and National Institute of Diabetes and Digestive and Kidney Diseases grant DK063491 to the Southern California Diabetes Endocrinology Research Center. Dr. Mozaffarian was supported by R01 HL085710 from the National Heart, Lung, and Blood Institute. The DILGOM study has been funded by the Academy of Finland (grant numbers 139635, 129494, 118065, 129322, 250207, 136895, 141005), the Orion-Farmos Research Foundation, the Finnish Foundation for Cardiovascular Research, and the Sigrid Jusélius Foundation. We thank the many colleagues who contributed to collection and phenotypic characterization of the clinical samples, and DNA extraction and genotyping of the data, especially Eija Hämäläinen, Minttu Jussila, Outi Törnwall, Päivi Laiho, and the staff from the Genotyping Facilities at the Wellcome Trust Sanger Institute. We would also like to acknowledge those who agreed to participate in the DILGOM study. EGCUT received financing by FP7 grants (278913, 306031, 313010), Center of Excellence in Genomics (EXCEGEN) and University of Tartu (SP1GVARENG). We acknowledge EGCUT technical personnel, especially Mr V. Soo and S. Smit. Data analyzes were carried out in part in the High Performance Computing Center of University of Tartu. The Family Heart Study (FamHS) work was supported in part by NIH grants 5R01 HL08770003, 5R01 HL08821502 (Michael A. Province) from NHLBI, and 5R01 DK07568102, 5R01 DK06833603 from NIDDK (Ingrid B. Borecki), and by the National Heart, Lung, and Blood Institute cooperative agreement grants U01 HL 67893, U01 HL67894, U01 HL67895, U01 HL67896, U01 HL67897, U01 HL67898, U01 HL67899, U01 HL67900, U01 HL67901, U01 HL67902, U01 HL56563, U01 HL56564, U01 HL56565, U01 HL56566, U01 HL56567, U01 HL56568, and U01 HL56569. The investigators thank the staff and participants of the FHS for their important contributions. The Framingham Offspring Study and Framingham Third Generation Study (FHS) were conducted in part using data and resources from the Framingham Heart Study of the National Heart Lung and Blood Institute of the National Institutes of Health and Boston University School of Medicine. The analyses reflect intellectual input and resource development from the Framingham Heart Study investigators participating in the SNP Health Association Resource (SHARe) project. This work was partially supported by the National Heart, Lung and Blood Institute's Framingham Heart Study (Contract No. N01‐HC‐25195) and its contract with Affymetrix, Inc. for genotyping services (Contract No. N02‐HL‐6‐4278). A portion of this research utilized the Linux Cluster for Genetic Analysis (LinGA‐II) funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center. Also supported by National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK) R01 DK078616 to Drs. Meigs, Dupuis and Florez, NIDDK K24 DK080140 to Dr. Meigs, and a Massachusetts General Hospital Physician Scientist Development Award and a Doris Duke Charitable Foundation Clinical Scientist Development Award to Dr. Florez. Dr. Hivert was supported by the Centre de Recherche Medicale de l’Universite de Sherbrooke (CRMUS) and a Canadian Institute of Health Research (CHIR) Fellowships Health Professional Award. Dr. Nicola McKeown is supported by the USDA agreement No. 58-1950-7-707. We thank all study participants as well as everybody involved in the Helsinki Birth Cohort Study. Helsinki Birth Cohort Study has been supported by grants from the Academy of Finland, the Finnish Diabetes Research Society, Folkhälsan Research Foundation, Novo Nordisk Foundation, Finska Läkaresällskapet, Signe and Ane Gyllenberg Foundation, University of Helsinki, European Science Foundation (EUROSTRESS), Ministry of Education, Ahokas Foundation, Emil Aaltonen Foundation. The Health, Aging and Body Composition (Health ABC) study was supported in part by the Intramural Research Program of the NIH, National Institute on Aging contracts N01AG62101, N01AG62103, and N01AG62106. The genome-wide association study was funded by NIA grant R01 AG032098 to Wake Forest University Health Sciences and genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is fully funded through a federal contract from the National Institutes of Health to The Johns Hopkins University, contract number HHSN268200782096C. The use of Health 2000 data in this study has been financially supported by the Academy of Finland (grant 250207) and Orion-Farmos Research Foundation. The authors would like to thank the many colleagues who contributed to collection and phenotypic characterization of the clinical samples, and DNA extraction and genotyping of the data, especially Eija Hämäläinen, Minttu Jussila, Outi Törnwall, Päivi Laiho, and the staff from the Genotyping Facilities at the Wellcome Trust Sanger Institute. They would also like to acknowledge those who agreed to participate in the H2000 study. Invecchiare in Chianti (aging in the Chianti area, InCHIANTI) study investigators thank the Intramural Research Program of the NIH, National Institute on Aging who are responsible for the InCHIANTI samples. Investigators also thank the InCHIANTI participants. The InCHIANTI study baseline (1998‐2000) was supported as a “targeted project” (ICS110.1/RF97.71) by the Italian Ministry of Health and in part by the U.S. National Institute on Aging (Contracts: 263 MD 9164 and 263 MD 821336). The Multi-Ethnic Study of Atherosclerosis (MESA) and MESA SHARe project are conducted and supported by contracts N01-HC-95159 through N01-HC-95169 and RR-024156 from the National Heart, Lung, and Blood Institute (NHLBI). Funding for MESA SHARe genotyping was provided by NHLBI Contract N02‐HL‐6‐4278. The authors thank the participants of the MESA study, the Coordinating Center, MESA investigators, and study staff for their valuable contributions. A full list of participating MESA investigators and institutions can be found at The NHS and HPFS are supported by the National Cancer Institute (NHS: UM1 CA186107, HPFS:UM1 CA167552) with additional support for genotyping. The NHS Breast Cancer GW scan was performed as part of the Cancer Genetic Markers of Susceptibility initiative of the NCI (R01CA40356, U01-CA98233). The NHS/HPFS type 2 diabetes GWAS (U01HG004399) is a component of a collaborative project that includes 13 other GWAS funded as part of the Gene Environment-Association Studies (GENEVA) under the NIH Genes, Environment and Health Initiative (GEI) (U01HG004738, U01HG004422, U01HG004402, U01HG004729, U01HG004726, 01HG004735, U01HG004415, U01HG004436, U01HG004423, U01HG004728, AHG006033) with additional support from individual NIH Institutes (NIDCR: U01DE018993, U01DE018903 NIAAA: U10AA008401, NIDA: P01CA089392, 01DA013423 NCI: CA63464, CA54281, CA136792, Z01CP010200). Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center (U01HG004446).Genotyping was performed at the Broad Institute of MIT and Harvard, with funding support from the NIH GEI (U01HG04424), and Johns Hopkins University Center for Inherited Disease Research, with support from the NIH GEI (U01HG004438) and the NIH contract "High throughput genotyping for studying the genetic contributions to human disease”(HHSN268200782096C). The NHS/HPFS CHD GWAS was supported by Merck/Rosetta Research Laboratories, North Wales, PA. The NHS/HPFS Kidney GWAS was supported by NIDDK: 5P01DK070756. The generation and management of GWAS genotype data for the Rotterdam Study is supported by the Netherlands Organization of Scientific Research NWO Investments (nr. 175.010.2005.011, 911‐03‐012), the Research Institute for Diseases in the Elderly (014‐93‐015 RIDE2),EUROSPAN (European Special Populations Research NetworkLSHG‐CT‐2006‐01947),the Netherlands Organization for Scientific Research (Pionier, 047.016.009, 047.017.043050‐060‐810), Erasmus Medical Center and the Centre for Medical Systems Biology (CMSB I and II and Grand National Genomics Initiative) of the Netherlands Genomics Initiative (NGI) The Rotterdam Study is further funded by Erasmus Medical Center and Erasmus University, Rotterdam, Netherlands Organization for the Health Research and Development (ZonMw), the Research Institute for Diseases in the Elderly (RIDE), the Ministry of Education, Culture and Science, the Ministry for Health, Welfare and Sports, the European Commission (DG XII), and the Municipality of Rotterdam. We thank Pascal Arp, Mila Jhamai, Dr Michael Moorhouse, Marijn Verkerk, and Sander Bervoets for their help in creating the GWAS database. The authors are grateful to the study participants, the staff from the Rotterdam Study and the participating general practitioners and pharmacists. The Hellenic study of Interactions between SNPs and Eating in Atherosclerosis Susceptibility (THISEAS) study thanks the Genotyping Facility at the Wellcome Trust Sanger Institute for typing the THISEAS samples and in particular Sarah Edkins and Cordelia Langford. PD is supported by the Wellcome Trust. The WGHS is supported by HL043851 and HL080467 from the National Heart, Lung, and Blood Institute and CA047988 from the National Cancer Institute, the Donald W. Reynolds Foundation and the Fondation Leducq, with collaborative scientific support and funding for genotyping provided by Amgen. The Young Finns Study has been financially supported by the Academy of Finland: grants 126925, 121584, 124282, 129378 (Salve), 117787 (Gendi), and 41071 (Skidi), the Social Insurance Institution of Finland, Kuopio, Tampere and Turku University Hospital Medical Funds (grant 9M048 for TeLeht), Juho Vainio Foundation, Paavo Nurmi Foundation, Finnish Foundation of Cardiovascular Research and Finnish Cultural Foundation, Tampere Tuberculosis Foundation and Emil Aaltonen Foundation (T.L). The expert technical assistance in the statistical analyses by Irina Lisinen, Ville Aalto and Mika Helminen are gratefully acknowledged. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the other funders.

    Competing interests: Luc Djousse reports receiving investigator-initiated grants (omega-3 fatty acid studies) from NIH and Amarin Pharma, Inc. Currently serving as ad hoc consultant for Amarin Pharma, Inc. Bruce Psaty reports serving on the DSMB of a clinical trial for a device funded by the manufacturer (Zoll LifeCor) and on the Steering Committee for the Yale Open Data Access Project funded by Johnson & Johnson." Paul Ridker has received research grant funds from AstraZeneca, a manufacturer of a prescription fish oil product. Oscar Franco works in ErasmusAGE, a center for aging research across the life course funded by Nestlé Nutrition (Nestec Ltd.) Metagenics Inc. and AXA. Nestlé Nutrition (Nestec Ltd.) Metagenics Inc. and AXA had no role in design and conduct of the study collection, management, analysis, and interpretation of the data and preparation, review or approval of the manuscript. Dr. Mozaffarian reports reports ad hoc honoraria or consulting from Bunge, Haas Avocado Board, Nutrition Impact, Amarin, Astra Zeneca, Boston Heart Diagnostics, GOED, and Life Sciences Research Organization and scientific advisory boards, Unilever North America and Elysium Health.Harvard University holds a patent, listing Dr. Mozaffarian as one of three co-inventors, for use of trans-palmitoleic acid to prevent and treat insulin resistance, type 2 diabetes, and related conditions. All other authors report no conflicts of interest. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

    Abbreviations: EPA, eicosapentaenoic acid DHA, docosahexaenoic acid LD, linkage disequilibrium GWAS, genome-wide association study

    4. Discussion

    Machine learning (ML) algorithms have recently caught the scientific community’s attention because of their flexibility, ease of use, and ability to learn from the data provided [55,56]. Via ML, it has been possible to develop models to identify individuals more susceptible to developing common and rare diseases [58,59,60,61,62,63,67,88,89,90,91,92,93] and determine diverse phenotypic response profiles in infectious diseases [94,95,96]. Considering that ML- and computational-based models have the potential to overcome the limitations of current established clinical models for the diagnosis and follow-up of neurodegenerative diseases, including AD [97], here we studied the feasibility of ML algorithms for predicting Alzheimer’s disease age of onset (ADAOO) in individuals from the Paisa genetic isolate. We argue that these ML-based predictive models will improve our understanding of the disease and provide a more accurate and precise definition of the AD natural history landmarks.

    We previously identified protective ( β ^   > 0 Table 1 ) and harmful ( β ^   < 0 Table 1 ) ADAOO-modifying variants of significant effect in this community from whole-exome genotyping and whole-exome sequencing data [35,36] using linear-mixed effects models and some ML methods [77]. Thus, the presence of the APOE*E2 allele alone delays ADAOO up to

    12 years in PSEN1 E280A mutation carriers. Furthermore, this same allele delays ADAOO up to

    17 years when included in an AD oligogenic model ( Table 1 ) [36]. Subsequent analysis led to the development of a classification tree using advanced recursive partitioning to determine whether individuals carrying this mutation would develop early-onset or late-onset familial AD [36]. Following a similar approach, our group was able to identify ADAOO modifier variants in individuals with sporadic AD ( Table 1 ) [35].

    After evaluating several ML-based predictive algorithms for ADAOO in individuals suffering from the most aggressive form of AD ( Figure 1 and Table 2 ) and in individuals with sporadic AD ( Figure 2 and Table 3 ), we identified that the glmboost and glmnet algorithms perform best for predicting ADAOO in unseen data for each cohort, respectively. These ML-based predictive models showed promising results that can be easily extended to the clinical setting [98]. In particular, the glmboost algorithm in E280A PSEN1 AD yielded MAE values below 4% and RMSE values of

    4 ( Table 2 ), while the glmnet algorithm yielded MAE values below 1% and RMSE values ρ in sAD ( Table 3 ), suggesting that predicting AOO in these cohorts is feasible. Using these ML-based ADAOO predictive models, AD diagnosis could be made earlier, and potential treatments are provided long before symptoms begin to appear.

    Analysis of variable importance shows that the most relevant ADAOO predictors in fAD are variants APOE-rs7412, FCRL5-rs16838748, GPR20-rs36092215, IFI16-rs62621173, AOAH-rs12701506, and PYNLIP-rs2682585 ( Figure 1 b and Figure 3 a). Furthermore, protective variants APOE-rs7412, GRP20-rs36092215, and FCRL5-rs16838748 have both the highest effect on ADAOO and are the most important predictors of ADAOO, while variants TRIM22-rs12364019, IFI16-rs62621173, and AOAH-rs12701506 have both the most harmful effect on ADAOO and are among the most important predictors of ADAOO ( Figure 4 a). Comparing these results with those of previous models predicting AD status (early- vs. late-onset) [36] shows some discrepancies in how the genetic variants are ranked and the relevance of demographic information (i.e., sex and years of education) for predicting AD status. Although predicting AD status may be of interest in some clinical settings, the use of ML-based predictive algorithms for ADAOO is a step forward in both our understanding of the disease and our goal of providing timely clinical care to individuals from this community. While AD cannot be cured and there is no way to stop or slow its progression at the moment, our approach offers the possibility of treating symptoms several years before they begin to appear [4,99,100] under an individually tailored biomarker scheme rather, than using a one-size-fits-all population average strategy [99,100,101], while taking individual variability into account. Although our results can certainly be used to move AD research in this direction, it is also important to consider the legal implications and the preparation that health providers, neurologists, and centers specializing in AD and neurodegeneration must have in order to interpret these findings and provide proper counseling to patients and their families [102,103,104]. Another challenge in the years to come is also to significantly reduce the misinformed conclusions produced by ML methods in the absence of clinical domain expertise [105]. In this regard, having a deep understanding of the clinical background in AD, how ML methods operate, and how the results can interpreted and translated to the patient and their relatives is crucial [57].

    Variants GPR45-rs35946826 and MAGI3-rs61742849 have both a more harmful effect on ADAOO and are the most important predictors of ADAOO in individuals with sAD ( Figure 4 b). Interestingly, the harmful effect on ADAOO of variants MYCBPAP-rs61749930 and EBLN1-rs838759 differs from those of other variants, but their importance for predicting ADAOO is lower, while variants CHGB-rs236150 and WDR46-rs3130257 accelerate ADAOO and have higher variable importance ( Figure 4 b). Among protective genetic variants, the highest effect is produced by OPRM1-rs675026, followed by HERC6-rs7677237 and C3orf20-rs34230332, with the former being the less important. Intriguingly, variant C16orf96-rs17137138 is the most important ADAOO predictor despite its small effect ( Figure 4 b).

    In summary, here we explore the feasibility of ML algorithms for predicting ADAOO using demographic and genetic data in individuals from the world’s most extensive pedigree segregating a severe form of AD caused by a fully penetrant mutation in the PSEN1 gene and individuals with sAD inhabiting the same geographical region. Based on the RMSE, MAE, and R 2 performance measures, our results indicate that ML algorithms are a feasible and promising alternative for assessing ADAOO in these individuals. Interestingly, the most important predictors in these ML-based predictive models were genetic variants, which makes it possible to assess ADAOO at the individual level and opens new personalized medicine and predictive genomic alternatives for AD [98,99,100,101].

    Future studies should assess the ability of the ML-based predictive models for ADAOO presented herein with out-of-sample data (i.e., determine how close the model is to predicting ADAOO in a patient with known genetic data that was not part of our cohorts) and the development of ML-based models of disease progression [38,50,51,60]. Ultimately, these models could help us to provide an easy-to-use platform, with potential application in the clinical setting, to provide early and accurate estimates of ADAOO and the evolution of AD in individuals with a family history of the disease.

    Watch the video: Introduction of Statistical Models for GWAS and GS (January 2022).