Information

SNP coding for association analysis


I'm working on a project about detecting SNP association with a disease. As I understand, SNP is a single variation of the nucleotide that occurs for more than 1% of the population. However, I couldn't connect this idea with the dataset in hand. The rows in my dataset represent each patient and the columns contain SNP information. For example:

ID exm355 exm615 1 T_T A_C 2 T_T C_C 3 A_T C_C

I have no idea why the SNP columns contain 2 nucleotides(T_T, A_T, A_C, C_C). As the definition of SNP, I thought it should show only the variant nucleotide or am I misunderstanding anything? How could I interpret T_T or C_C and how could I know which nucleotide is the variation from the common ones in the population?

Thanks all


Each chromosome location which has been identified as being a SNP is a location at which more than one nucleotide occurs at appreciable frequencies in the general population. This means that there are two or more bases which can occur there, so a person's test must show which bases actually occur there in that person's genome. Since a person has both a paternal and a maternal chromosome of a chromosomal type (a pair of homologous chromosomes), the person has two instances of the SNP location and so two nucleotides to be detected and reported. Hence your dataset has two nucleotides for each SNP location for each patient.

Just from the reported base or the SNP name you can't tell which is more frequent in the population. If you need to know that, you must consult SNP frequency data from some other database. (The SNP names in your example, e.g. "exm355" are not familiar; usually SNPs have names like "rs1234567".)


A non-coding CRHR2 SNP rs255105, a cis-eQTL for a downstream lincRNA AC005154.6, is associated with heroin addiction

Dysregulation of the stress response is implicated in drug addiction therefore, polymorphisms in stress-related genes may be involved in this disease. An analysis was performed to identify associations between variants in 11 stress-related genes, selected a priori, and heroin addiction. Two discovery samples of American subjects of European descent (EA, n = 601) and of African Americans (AA, n = 400) were analyzed separately. Ancestry was verified by principal component analysis. Final sets of 414 (EA) and 562 (AA) variants were analyzed after filtering of 846 high-quality variants. The main result was an association of a non-coding SNP rs255105 in the CRH (CRF) receptor 2 gene (CRHR2), in the discovery EA sample (Pnominal = .00006 OR = 2.1 95% CI 1.4–3.1). The association signal remained significant after permutation-based multiple testing correction. The result was corroborated by an independent EA case sample (n = 364). Bioinformatics analysis revealed that SNP rs255105 is associated with the expression of a downstream long intergenic non-coding RNA (lincRNA) gene AC005154.6. AC005154.6 is highly expressed in the pituitary but its functions are unknown. LincRNAs have been previously associated with adaptive behavior, PTSD, and alcohol addiction. Further studies are warranted to corroborate the association results and to assess the potential relevance of this lincRNA to addiction and other stress-related disorders.

Citation: Levran O, Correa da Rosa J, Randesi M, Rotrosen J, Adelson M, Kreek MJ (2018) A non-coding CRHR2 SNP rs255105, a cis-eQTL for a downstream lincRNA AC005154.6, is associated with heroin addiction. PLoS ONE 13(6): e0199951. https://doi.org/10.1371/journal.pone.0199951

Editor: Z. Carl Lin, Harvard Medical School, UNITED STATES

Received: May 22, 2018 Accepted: June 15, 2018 Published: June 28, 2018

Copyright: © 2018 Levran et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Genotype/phenotype data is available in dbGAP with accession number: phs001109.v1.p1. Additional relevant data are within the paper and its Supporting Information file.

Funding: This work was supported by The Dr. Miriam and Sheldon G. Adelson Medical Research Foundation, the National Institutes of Health—National Institute on Drug Abuse Research Grant P60-05130 (M.J.K.), the National Institutes of Health—National Institute on Drug Abuse Research Grant R01-12848 (M.J.K), and the National Institute of Health—National Center for Advancing Translational Sciences Grant UL1RR024143 (B. Coller). CTN-0051 was supported by several grants from the National Institutes of Health—National Institute on Drug Abuse—National Drug Abuse Treatment Clinical Trials Network (CTN): U10DA013046, UG1/U10DA013035, UG1/U10DA013034, U10DA013045, UG1/U10DA013720, UG1/U10DA013732, UG1/U10DA013714, UG1/U10DA015831, U10DA015833, HHSN271201200017C, and HHSN271201500065C. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 4/10/2018. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


SNP coding for association analysis - Biology

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited.

Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review prior to publication.

The Feature Paper can be either an original research article, a substantial novel research study that often involves several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest progress in the field that systematically reviews the most exciting advances in scientific literature. This type of paper provides an outlook on future directions of research or possible applications.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.


In Silico Analysis of Coding/Noncoding SNPs of Human RETN Gene and Characterization of Their Impact on Resistin Stability and Structure

Resistin (RETN) is a gene coding for proinflammatory adipokine called resistin secreted by macrophages in humans. Single nucleotide polymorphisms (SNPs) in RETN are linked to obesity and insulin resistance in various populations. Using dbSNP, 78 nonsynonymous SNPs (nsSNPs) were retrieved and tested on a PredictSNP 1.0 megaserver. Among these, 15 nsSNPs were predicted as highly deleterious and thus subjected to further analyses, such as conservation, posttranscriptional modifications, and stability. The 3D structure of human resistin was generated by homology modeling using Swiss model. Root-mean-square deviation (RMSD), hydrogen bonds (h-bonds), and interactions were estimated. Furthermore, UTRscan served to identify UTR functional SNPs. Among the 15 most deleterious nsSNPs, 13 were predicted to be highly conserved including variants in posttranslational modification sites. Stability analysis predicted 9 nsSNPs (I32S, C51Y, G58E, G58R, C78S, G79C, W98C, C103G, and C104Y) which can decrease protein stability with at least three out of the four algorithms used in this study. These nsSNPs were chosen for structural analysis. Both variants C51Y and C104Y showed the highest RMS deviations (1.137 Å and 1.308 Å, respectively) which were confirmed by the important decrease in total h-bonds. The analysis of hydrophobic and hydrophilic interactions showed important differences between the native protein and the 9 mutants, particularly I32S, G79C, and C104Y. Six SNPs in the 3

UTR (rs920569876, rs74176247, rs1447199134, rs943234785, rs76346269, and rs78048640) were predicted to be implicated in polyadenylation signal. This study revealed 9 highly deleterious SNPs located in the human RETN gene coding region and 6 SNPs within the 3 UTR that may alter the protein structure. Interestingly, these SNPs are worth to be analyzed in functional studies to further elucidate their effect on metabolic phenotype occurrence.

1. Introduction

Genomic variation understanding is one of the major challenges of current genomics research field, due to the enormous number of genetic variations in the human genome. Single nucleotide polymorphisms (SNPs) represent the most abundant genetic variations throughout the human genome ranging between 3 and 5 million in each individual [1]. Mostly, SNPs are neutral, but some of them contribute to disease predisposition by modifying protein function or as genetic markers in order to find nearby disease-causing mutations through genetic association studies and family-based studies [2]. Scientists believe that these variants may also influence the response to some drugs [3].

SNPs that change the encoded amino acids are called nonsynonymous single nucleotide polymorphisms (nsSNPs). Nonsynonymous SNPs, forming about half of all genetic changes related to human diseases, can influence resulting protein structure and/or function with either neutral or deleterious effects [4, 5].

Moreover, the study of noncoding DNA is also important because it contains the majority of reported SNPs in human genome. Polymorphisms in 5 and 3 untranslated regions (UTRs) are of major interest because they can affect gene expression and posttranscriptional and posttranslational activities and thus be of functional relevance [6, 7].

Resistin is a proinflammatory adipokine which belongs to the cysteine-rich C-terminal domain proteins called resistin-like molecules (RELMs) and mainly secreted by adipocytes in rodents and macrophages in humans [8, 9]. The gene encoding resistin (RETN) is located on chromosome 19p13.2. It was shown that resistin is linked to several inflammatory disorders including obesity, type 2 diabetes, cardiovascular disease, and asthma [10–13]. This protein has effects which antagonize insulin action. Some studies have shown that resistin affects glucose transport and causes insulin-stimulated insulin receptor substrate-1 (IRS-1) degradation leading to insulin resistance induction [14–16]. Circulating resistin levels were reported to be significantly increased in both genetically and diet-induced obese mice and decreased with the administration of the antidiabetic drug Rosiglitazone [8].

Moreover, a case-control study on type 1 diabetes mellitus patients showed that the combination of insulin and Rosiglitazone decreased resistin and leptin levels significantly [17]. Genetic variants in RETN showed a significant association with circulating resistin levels. Beckers et al. identified the first missense mutation C78S in resistin in a morbidly obese proband and his obese mother. This finding encourages the study of variants in the RETN gene coding region to elucidate their involvement in pathogenesis [18]. It was estimated that genetic factors can explain up to 70% of the variation in circulating resistin levels [19]. However, analyses of the association between SNPs of the RETN gene and anthropometric variables and alterations related to obesity revealed inconsistent results [10, 20–23].

Basing on the importance of RETN gene in multiple inflammatory diseases, particularly metabolic abnormalities, we conducted a computational analysis using nsSNP effect predictors like SIFT, PolyPhen, PANTHER, PhD-SNP and PredictSNP. Most deleterious nsSNPs were further analyzed by conservation and stability tools. Finally, a structural analysis was conducted in order to identify the most functionally deleterious SNPs in coding and untranslated regions.

2. Material and Methods

2.1. Dataset Collection

The SNP information of RETN gene was collected from dbSNP (http://www.ncbi.nlm.nih.gov/snp/). The amino acid sequence of the protein (NCBI accession: NP_001180303) was retrieved from the NCBI protein database (http://www.ncbi.nlm.nih.gov/protein). The theoretical structure of resistin (PDB ID: 1LV6) was abandoned since it was not in agreement with the crystal structure available for mouse resistin now.

2.2. Prediction of Deleterious nsSNPs

PredictSNP1.0 (http://loschmidt.chemi.muni.cz/predictsnp1/) [24] was used as the predictor of the SNP effect on protein function. This resource is a consensus classifier that enables access to the nine best performing prediction tools: SIFT, PolyPhen-1, PolyPhen-2, MAPP, PhD-SNP, SNAP, PANTHER, PredictSNP, and nsSNPAnalyzer.

SIFT (Sorting Intolerant from Tolerant) predicts whether an amino acid substitution affects the protein function based on sequence homology and the physical properties of amino acids [25]. SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions in every position of the query sequence. PolyPhen-1 uses expert set of empirical rules to predict possible impact of amino acid substitutions, while PolyPhen-2 (Polymorphism Phenotyping v2) predicts the potential effect of an amino acid substitution on the structure and function of a human protein using multiple sequence alignment and structural information. MAPP (Multivariate Analysis of Protein Polymorphism) analyzes the physicochemical variation present in each column of a protein sequence alignment and predicts the impact of amino acid substitutions on the protein function [26]. PhD-SNP (Predictor of human Deleterious Single Nucleotide Polymorphisms) is a support vector machine- (SVM-) based predictor used to classify nsSNPs into human genetic disease-causing or benign mutations [27]. SNAP (screening for nonacceptable polymorphisms) is a neural network-based method used to predict functional effects of nonsynonymous SNPs using in silico derived protein information [28]. PANTHER (Protein Analysis Through Evolutionary Relationships) estimates the likelihood of a particular nsSNP to cause a functional effect on the protein using position-specific evolutionary preservation [29]. nsSNPAnalyzer uses a machine learning method called random forest to predict whether the nsSNP has a phenotypic effect [30] based on multiple sequence alignment and 3D structure information. Finally, PredictSNP1.0 displays the confidence scores generated by each tool and a consensus prediction as percentages by using their observed accuracy values to simplify comparisons [24].

2.3. Sequence Conservation

A ConSurf web server (http://consurf.tau.ac.il/) was used to analyze amino acid sequence conservation. This web-based algorithm predicts the crucial functional regions of a protein by estimating the degree of amino acid conservation based on multiple sequence alignment. The grade range from 1 to 9 estimates the extent of conservation of the amino acid throughout evolution. Therefore, grade 9 represents the most highly conserved residue, and the numbers descend to 1 representing the least conserved region. This tool analyzes the conservation at the nucleotide and amino acid levels.

2.4. Prediction of Posttranslational Modification Sites

A ModPred web server (http://www.modpred.org/) was used to predict posttranslational modification (PTM) sites the server consists of a set of bootstrapped logistic regression models for each type of PTM, retrieved from 126,036 nonredundant PTM sites verified experimentally, the literature, and from the databases [31]. Results are given as residue, modification, score, confidence, and remarks. In this study, only medium and high confidence PTMs were taken into consideration.

2.5. Prediction of Change in Protein Stability

The change in protein stability due to nsSNPs was predicted using I-Mutant2.0 (http://folding.biofold.org/cgi-bin/i-mutant2.0), which is a support vector machine (SVM) web-based tool used for the automatic prediction of changes in protein stability due to SNP. It provides the predicted free energy change value (DDG) and the sign of the prediction as increase or decrease. DDG value is calculated from the unfolding Gibbs free energy value of the mutated protein minus the unfolding Gibbs free energy value of the wild type in kcal/mol.

means that the protein stability increased, and

means that the protein stability decreased [32].

The stability was also checked by a MUpro tool (http://mupro.proteomics.ics.uci.edu/). This server is based on two machine learning methods: support vector machines and neural networks. Both of them were trained on a large mutation dataset and showed accuracy above 84%.

This protein calculates a score between -1 and 1 as the confidence of prediction. The confidence

indicates that the mutation decreases the protein stability, while a confidence means that the mutation increases the protein stability [33].

2.6. Scanning of UTR SNPs in the UTR Site

The 5 and 3 untranslated regions (UTRs) have crucial roles in degradation, translation, and localization of mRNAs as well as the regulation of protein-protein interaction. We used the UTRscan web server http://itbtools.ba.itb.cnr.it/utrscan to predict the functional SNPs in the 5 and 3 UTRs. The UTRscan tool allows the enquirer to search user-submitted sequences for any of the motifs present in UTRsite. UTRsite derives data from UTRdb, a curated database that updates UTR datasets through primary data mining and experimental validation [7, 34]. To perform this analysis, the primary FASTA format data was submitted and the results were showed in the form of signal names and their positions in the transcript.

2.7. Structural Analysis
2.7.1. Modeling of Native and Mutant Structure

The transcript with the reference sequence NP_001180303.1 was used for the homology modeling. We selected the X-ray crystal structure of Mus musculus resistin from the Protein Data Bank (PDB) with PDB code 1RGX [9] as a template to generate a human resistin by homology modeling using the Swiss model platform (https://swissmodel.expasy.org). The model has a QMEAN of -1.83 and a sequence identity of 55.56% (Figure 1).

UCSF Chimera was used to confirm the corresponding positions of the SNPs and to construct the 15 mutant models [35]. It is a highly extensible program developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, for interactive visualization and analysis of molecular structures and related data.

The energy minimization of the wild type and mutant structures was performed by NOMAD-Ref server Gromacs-based as a default force field we used conjugate gradient method for the 3D structure optimization [36].

2.7.2. RMSD and Total Hydrogen Bond Prediction

UCSF Chimera served again to check RMS deviation by superimposing both native and mutant structures. In addition, this tool served to calculate total h-bond values for each structure.

2.7.3. Interaction Analysis

COCOMAPS (bioCOmplexes COntact MAPS) is a web application to effectively analyze and visualize the interface in biological protein-protein complexes by making use of intermolecular contact maps. The input file was the resistin homology model in PDB format. In our study, we used COCOMAPS to analyze the interaction between the three monomers of resistin protein [37]. To achieve this, we uploaded the PDB file of resistin trimer (A, B, and C as chain IDs for each monomer) and we then compared the interaction interfaces between the two chains A and B considered as Molecule 1 interacting with the third chain C considered as Molecule 2 (interactions include residues from chain A and from chain B together interacting with chain C).

2.7.4. Prediction of Protein-Protein Interactions

STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, available at http://string-db.org) is a database of known and predicted protein interactions, which currently covers 9,643,763 proteins from 2031 organisms. This database provides a critical assessment and integration of protein-protein interactions including direct (physical) and indirect (functional) associations [38].

3. Results

3.1. SNP Datasets

The RETN SNP data investigated in this work was retrieved in early October 2018 from dbSNP database (http://www.ncbi.nlm.nih.gov/snp/?term=RETN). It contained a total of 1075 SNPs. Out of which, 78 were nsSNPs, 35 were coding synonymous SNPs, 339 were located in the noncoding region, which comprises 18 SNPs in the 5 UTR, 35 SNPs were in the 3 UTR, and 287 were in the intronic region.

3.2. Prediction of Deleterious nsSNPs

A total of 78 nsSNPs were selected for our investigation. This SNP collection was analyzed with various in silico prediction tools to measure their effects on pathogenicity and to find out disease-associated SNPs. All nsSNPs which were obtained from SNP database were loaded to PredictSNP1.0, and all available integrated tools were selected for prediction. Fifteen nsSNPs were predicted as deleterious by all integrated tools, except for nsSNPAnalyzer and PANTHER that did not give any prediction for any mutation. According to SNAP, a total of 38 nsSNPs out of 54 were predicted to be deleterious (70.37%), followed by MAPP with 37 deleterious nsSNPs (68.51%), PolyPhen-2 with 31 nsSNPs (57.40%), PolyPhen-1 with 25 nsSNPs (46.29%), SIFT with 26 nsSNPs (48.15%), and PhD-SNP with 18 nsSNPs (33.33%). The nsSNPs predicted as deleterious are listed in Table 1 with the expected accuracy and are selected for further analysis (Table 1).

3.3. Analysis of Conservation

The results of ConSurf analysis showed that 13 deleterious missense SNPs are located in highly conserved regions, with conservation values ranging between 7 and 9, which suggests that these positions are important for the resistin integrity. Among these, three residues were predicted to be exposed and functional, five others were predicted to be buried and structural, two buried residues and one exposed residue." while we should mention at the beginning of the paragraph that "11 deleterious missence SNPs are located in highly conserved regions", because we mentionned just after this that conservation values are ranging between 7 and 9 so we excluded G71 ( score: 4) and R84 (score: 6). The position 84 was predicted as moderately conserved, and the position 71 was predicted as variable residue therefore, they were not selected for structural analysis.

3.4. Prediction of Posttranslational Modification Sites

ModPred was used to predict posttranslational modification sites present within the human resistin protein. Only PTMs with high or medium confidence were discussed. In the native protein, position R84 was predicted as a site of ADP-ribosylation, W98 as a site of C-linked glycosylation or proteolytic cleavage, and C103 and C104 as disulfide linkage sites. After mutagenesis, C51 appeared as a site of amidation with the change of Cys to Tyr, while the position W98 changed to a disulfide linkage site with the change of Trp to Cys. Regarding the position C104, it was predicted that the change of Cys to Tyr conferred an amidation site with a high confidence. The results of ModPred are shown in Table 2.

3.5. The Impact of Predicted Deleterious Mutations on Resistin Protein Stability

We analyzed the 13 missense substitutions predicted as deleterious from the previous steps with the I-Mutant2.0. and MUpro web server. nsSNPs predicted to decrease stability with both tools were selected for further structural analysis. The results are showed in Table 3.

3.6. Structural Analysis
3.6.1. Modeling of Human Resistin Structure

Using the X-ray crystal structure (1rgx) as a template, we modeled the 3D structure of native human resistin using the Swiss model web server. Figure 2 showed the generated model as a trimer with three monomers (A, B, and C). This trimer was used to construct the 9 mutant models of human resistin.

3.6.2. RMSD Difference and Total Hydrogen Bonds

The RMSD values associated with the 9 mutants are given in Table 4. As the RMSD value increases, the deviation between native- and mutant-type structures will be higher and thus may induce a change in protein activity. Altered C51Y and C104Y mutants showed the highest RMSD results are shown in Figures 2(a) and 2(b). In addition, total h-bonds were calculated to assess their contribution in the stability and the folding of the native protein. All mutated structures revealed a change in total h-bonds compared to the native resistin, but the C104Y mutant showed a remarkable decrease forming 254 h-bonds while the native structure formed 291. Moreover, the visualization of native structure showed that C51 and C104 residues form a disulfide bond with each other (Figure 2(d)) the change of cysteine carried on the alpha helix in these positions induces the breakage of the disulfide bridge (Figures 2(c) and 2(e)) which may disturb the protein structure.

3.6.3. Interaction Analysis

The interface contacts between the amino acids present within the resistin trimer were studied using COCOMAPS. Variation in the number of different types of interactions was observed between the native and 9 resistin mutants the results are given in Table 5.

Regarding the number of hydrophilic-hydrophilic interactions, the native complex participated with 262 hydrophilic-hydrophilic interactions. The mutant complexes I32S, C51Y, G79C, and C104Y, showed a significant increase in the number of hydrophilic-hydrophilic interactions with 286, 266, 277, and 266 interactions, respectively, which indicate a reduction in the hydrophobicity of these mutant trimers. In addition, the mutant complex C103G showed a significant increase in the number of hydrophobic-hydrophobic interactions indicating the increase of its hydrophobicity.

Moreover, we found that the C51Y mutant trimer interacts with only 75 residues of chain C forming the trimer complex while in the native complex, chain C interacts with 78 residues. This small deviation may disrupt resistin trimer formation.

3.6.4. Prediction of the Effect of SNPs Located in the UTR by a UTRscan Server

The UTRscan server was used to predict the effect of UTR SNPs on transcriptional motif. Six SNPs in the 3 UTR, namely, rs920569876, rs74176247, rs1447199134, rs943234785, rs76346269, and rs78048640, were predicted to be in polyadenylation sites and thus may be responsible for pathological phenotypes. Results are given in Table 6.


RESULTS

Case study

To illustrate the performances of snpXplorer, we explored the most recent set of common SNPs associated with late-onset Alzheimer's disease (AD, N = 83 SNPs, Supplementary Table S1 ) ( 43). Using this dataset as case study, we show the benefits of using snpXplorer in a typical scenario. Briefly, AD is the most prevalent type of dementia at old age, and is associated with a progressive loss of cognitive functions, ultimately leading to death. In its most common form (late-onset AD, with age at onset typically >65 years), the disease is estimated to be 60–80% heritable. With an attributable risk of ∼30%, genetic variants in APOE gene represent the largest common genetic risk factor for AD. In addition to APOE, the genetic landscape of AD now counts 83 common variants that are associated with a slight modification of the risk of AD. Understanding the genes most likely involved in AD pathogenesis as well as the crucial biological pathways is warranted for the development of novel therapeutic strategies for AD patients.

We retrieved the list of AD-associated genetic variants in Table 1 of the preprint from Bellenguez et al. ( 43). This study represent the largest GWAS on AD performed to date, and resulted in 42 novel SNPs reaching genome-wide evidence of association with AD. The exploration section of snpXplorer can be firstly used to inspect the association statistics of the novel SNP-associations in previous studies of the same trait (i.e. International Genomics of Alzheimer Project (IGAP) and family history of AD (proxy_AD)). Specifically, a suggestive degree of association in these regions is expected to be found in earlier studies. As expected, suggestive association signals were already observed for the novel SNPs, increasing the likelihood that these novel SNPs are true associations ( Supplementary Figure S1 ).

After the first explorative analysis, we pasted the variant identifiers (rsIDs) in the annotation section of snpXplorer, specifying rsid as input type, Gene Ontology and Reactome as gene-sets for the enrichment analysis, and Blood as GTEx tissue for eQTL (i.e. the default value). The N = 83 variants were linked to a total of 162 genes, with N = 54 variants mapping to one gene, N = 12 variants mapping to two genes, N = 7 variants mapping to three genes, N = 2 variants mapping to four genes, N = 1 variant mapping to five genes, N = 4 variants mapping to four genes, and N = 1 variant mapping to 7, 8 and 11 genes ( Supplementary Figure S2 ). N = 10 variants were found to be coding variants, N = 31 variants were found to be eQTL and N = 42 variants were annotated based on their genomic position. These results are returned to the user in the form of a (human and machine-readable) table, but also in the form of a summary plot (Figure 2A and Supplementary Figure S2 ). These graphs not only inform the user about the effect of the SNPs of interest (for example, a direct consequence on the protein sequence in case of coding SNPs, or a regulatory effect in case of eQTLs or intergenic SNPs), but also suggest the presence of more complex regions: for example, Supplementary Figure S2B indicates the number of genes associated with each SNP, which normally increases for complex, gene-dense regions such as HLA-region or IGH-region.

Results of the functional annotation of N = 83 variants associated with Alzheimer's disease (AD). (A) The circular summary figure shows the type of annotation of each genetic variant used as input (coding, eQTL or annotated by their positions) as well as each variant's minor allele frequency and chromosomal distribution. (B) REVIGO plot, showing the remaining GO terms after removing redundancy based on a semantic similarity measure. The colour of each dot codes for the significance (the darker, the more significant), while the size of the dot codes for the number of similar terms removed from REVIGO. (C) Results of our term-based clustering approach. We used Lin as semantic similarity measure to calculate similarity between all GO terms. We then used ward-d2 as clustering algorithm, and a dynamic cut tree algorithm to highlight clusters. Finally, for each cluster we generated wordclouds of the most frequent words describing each cluster.

Results of the functional annotation of N = 83 variants associated with Alzheimer's disease (AD). (A) The circular summary figure shows the type of annotation of each genetic variant used as input (coding, eQTL or annotated by their positions) as well as each variant's minor allele frequency and chromosomal distribution. (B) REVIGO plot, showing the remaining GO terms after removing redundancy based on a semantic similarity measure. The colour of each dot codes for the significance (the darker, the more significant), while the size of the dot codes for the number of similar terms removed from REVIGO. (C) Results of our term-based clustering approach. We used Lin as semantic similarity measure to calculate similarity between all GO terms. We then used ward-d2 as clustering algorithm, and a dynamic cut tree algorithm to highlight clusters. Finally, for each cluster we generated wordclouds of the most frequent words describing each cluster.

In order to prioritize candidate genes, the authors of the original publication integrated (i) eQTLs and colocalization (eQTL coloc) analyses combined with expression transcriptome-wide association studies (eTWAS) in AD-relevant brain regions (ii) splicing quantitative trait loci (sQTLs) and colocalization (sQTL coloc) analyses combined with splicing transcriptome-wide association studies (sTWAS) in AD-relevant brain regions (iii) genetic-driven methylation as a biological mediator of genetic signals in blood (MetaMeth) ( 43). In order to compare the SNP-gene annotation of the original study with that of snpXplorer, we counted the total number of unique genes associated with the SNPs (i) in the original study (N = 97), (ii) using our annotation procedure (N = 136) and (iii) the intersection between these gene sets (N = 79). When doing so, we excluded regions mapping to the HLA-gene cluster and IGH-gene clusters (three SNPs in total) as the original study did not report gene names but rather HLA-cluster and IGH-cluster. Nevertheless, our annotation procedure correctly assigned HLA-related genes and IGH-related genes with these SNPs. The number of intersecting genes was significantly higher than what could be expected by chance (P = 0.03, based on one-tail P-value of binomial test, Supplementary Table S2 ). For six SNPs, the gene annotated by our procedure did not match the gene assigned in the original study. Specifically, for 4/6 of these SNPs, we found significant eQTLs in blood (rs60755019 with ADCY10P1, rs7384878 with PILRB, STAG3L5P, PMS2P1, GIGYF1 and EPHB4 genes, rs56407236 with FAM157C gene, and rs2526377 with TRIM37 gene), while the original study reported the closest genes as most likely gene (rs60755019 with TREML2 gene, rs7384878 with SPDYE3 gene, rs56407236 with PRDM7 gene and rs2526377 with TSOAP1 gene). In addition, we annotated SNPs rs76928645 and rs139643391 to SEC61G and WDR12 genes (closest genes), while the original study, using eQTL and TWAS in AD-relevant brain regions, annotated these SNPs to EGFR and ICA1L/CARF genes. While the latter two SNPs were likely mis-annotated in our procedure (due to specific datasets used for the annotation), our annotation of the former four SNPs seemed robust, and further studies will have to clarify the annotation of these SNPs.

With the resulting list of input SNPs and (likely) associated genes, we probed the GWAS-Catalog and the datasets of structural variations for previously reported associations. We found a marked enrichment in the GWAS-Catalog for Alzheimer's disease, family history of Alzheimer's disease, and lipoprotein measurement ( Supplementary Figure S3 , Supplementary Table S3 and S4 ). The results of this analysis are relevant to the user as they indicate other traits that were previously associated with the input SNPs. As such, they may suggest relationships between different traits, for example in our case study they suggest the involvement of cholesterol and lipid metabolism in AD, a known relationship ( 44). Next, we searched for all structural variations in a region of 10kb surrounding the input SNPs, and we found that for 39/83 SNPs, a larger structural variations was present in the vicinity ( Supplementary Table S5 ), including the known VNTR (variable number of tandem repeats) in ABCA7 gene ( 45), and the known CNV (copy number variation) in CR1, HLA-DRA and PICALM genes ( Supplementary Table S5 ) ( 46–48). This information may be particularly interesting for experimental researchers investigating the functional effect of SVs, and could be used to prioritize certain genomic regions. Because of the complex nature of large SVs, these regions have been largely unexplored, however technological improvements now make it possible to accurately measure SV alleles.

We then performed our (sampling-based) gene-set enrichment analysis using Gene Ontology Biological Processes (GO:BP, default setting) and Reactome as gene-set sources, and Blood as tissue for the eQTL analysis. After averaging P-values across the number of iterations, we found N = 132 significant pathways from Gene Ontology (FDR<1%) and N = 4 significant pathways from Reactome (FDR <10%) ( Supplementary Figure S4 and Supplementary Table S6 ). To facilitate the interpretation of the gene-set enrichment results, we clustered the significantly enriched terms from Gene Ontology based on a semantic similarity measure using REVIGO (Figure 2B) and our term-based clustering approach (Figure 2C). Both methods are useful as they provide an overview of the most relevant biological processes associated with the input SNPs. Our clustering approach found five main clusters of GO terms (Figure 2C and Supplementary Figure S5 ). We generated wordclouds to guide the interpretation of the set of GO terms of each cluster (Figure 2C). The five clusters were characterized by (i) trafficking and migration at the level of immune cells (ii), activation of immune response (iii), organization and metabolic processes (iv), beta-amyloid metabolism and (v ) amyloid and neurofibrillary tangles formation and clearance (Figure 2C). All these processes are known to occur in the pathogenesis of Alzheimer's disease from other previous studies ( 43, 44, 49, 50). We observed that clusters generated by REVIGO are more conservative (i.e. only terms with a high similarity degree were merged) as compared to our term-based clustering which generates a higher-level overview. In the original study ( Supplementary Table S15 from ( 43)), the most significant gene sets related to amyloid and tau metabolism, lipid metabolism and immunity. In order to calculate the extent of term overlap between results from the original study and our approach, we calculated semantic similarity between all pairs of significantly enriched terms in both studies. In addition to showing pairwise similarities between all terms, this analysis also shows how the enriched terms in the original study relate to the clusters found using our term-based approach. We observed patterns of high similarity between the significant terms in both studies ( Supplementary Figure S6 ). For example, terms in the ‘Activation of immune system’ and the ‘Beta-amyloid metabolism’ clusters (defined with our term-based approach), reported high similarities with specific subsets of terms from the original study. This was expected as these clusters represent the most established biological pathways associated with AD. The cluster ‘Trafficking of immune cells’ had high similarity with a specific subset of terms from the original study, yet we also observed similarities with the ‘Activation of immune system’ cluster, in agreement with the fact that these clusters were relatively close also in tree structure (Figure 2C). Similarly, high similarities were observed between the ’Beta-amyloid metabolism’ and the ‘Amyloid formation and clearance’ clusters. Finally, the ‘Metabolic processes’ had high degree of similarity with a specific subset of terms, but also with terms related to ‘Activation of immune system’ cluster. Altogether, we showed that (i) enriched terms from the original study and our study had a high degree of similarity, and (ii) that the enriched terms of the original study resembled the structure of our clustering approach. The complete analysis of 83 genetic variants took about 30 minutes to complete.


Conclusion

In summary, we uncover a hidden layer of human A-to-I editing SNP loci that are of functional importance, enriched in GWAS signals for autoimmune diseases, and subject to balancing selection. Various types of RNA editing, including A-to-I editing, alter sequence relative to the genome at the RNA level, thus providing a rich resource of RNA variants that potentially produce functionally altered genes. For some of the RNA variants that are beneficial under certain conditions, once the same type of mutation occurs at the DNA level, it may be selectively maintained and become the target of balancing selection. Therefore, we hypothesized that RNA editing, as exemplified in this study with A-to-I editing, may be an unrecognized type of the common target of balancing selection in various species.


OPINION article

Long non-coding RNAs (LncRNAs) are RNAs with more than 200 nucleotides and are mostly transcribed by RNA polymerase II from different regions across the genome. They are currently known as key regulators of cellular function through different mechanisms such as epigenetic regulation, miRNA sponging, and modulating of proteins and enzyme cofactors (Kurokawa, 2011 Nie et al., 2012 Flynn and Chang, 2014 Birgani et al., 2017 Marchese et al., 2017). By this way, they are implicated in development pathways (Amaral and Mattick, 2008). Different lncRNAs such as HOTAIR can play their important roles by changing the chromatin states of the genome (Mercer and Mattick, 2013). Rinn et al. introduced this RNA as a spliced and polyadenylated RNA with 2,158 nucleotides (Hajjari et al., 2013). HOTAIR, as one of the featured lncRNAs, is located between HOXC11 and HOXC12 on chromosome 12q13.3. HOTAIR forms stem-loop structures which bind to histone modification complexes lysine-specific demethylase 1 (LSD1) and Polycomb Repressive Complex2 (PRC2) in order to recruit them on specific target genes. This RNA interacts with Polycomb repressive Complex2 (PRC2) and has a lot of targets such as HOXD. By this way, PRC2 can repress the desired genes leading into increased growth, proliferation, survival, metastasis, invasion, and drug resistance in some cancer cells (Rinn et al., 2007 He et al., 2011 Davidovich et al., 2013 Hajjari et al., 2014 Martens-Uzunova et al., 2014 Zhao et al., 2014). So, different studies have indicated the dysregulation of HOTAIR in different types of cancers in recent years (Gupta et al., 2010 Kogo et al., 2011 Yang et al., 2011 Niinuma et al., 2012 Hajjari et al., 2013 Kim et al., 2013 Li et al., 2013).

In recent studies, there are some reports indicating the role of HOTAIR SNPs which make it a significant cancer susceptibility locus and provide high risk for some cancers (Qi et al., 2016), like breast (Bayram et al., 2015, 2016 Yan et al., 2015), gastric (Pan et al., 2016 Tian et al., 2016), cervical (Guo et al., 2016 Qiu et al., 2016), papillary thyroid carcinoma (Zhu et al., 2016), osteosarcoma (Zhou et al., 2016), prostate (Taheri et al., 2017), ovarian (Wu et al., 2016 Qiu et al., 2017), and colorectal cancers (Xue et al., 2014). This is an interesting point because these SNPs may have effect on gene expression, function, and regulators of epigenome (Hajjari and Rahnama, 2017). Therefore, we think that more studies on these SNPs can reveal the potential of these SNPs for considering them as markers of progression and diagnosis of different cancers.

Figure 1 shows the locations of these SNPs within HOTAIR gene. Herein, we present different SNPs to highlight their potential for further studies.

Figure 1. Locations of different SNPs within HOTAIR gene and their association with different types of cancer (E: Exon, exons of HOTAIR, and HOXC12 are shown by green and red boxes). Genomic positions are based on the UCSC Genome browser on Human Dec. 2013 (GRCh38/hg38) assembly.

There are some reports indicating the association between HOTAIR rs12826786 SNP which is located between HOTAIR and HOXC12. The increased risk for some cancers such as breast (BC) (Bayram et al., 2016), gastric adenocarcinoma (GCA) (Guo et al., 2015), prostate cancer (PC), and benign prostate hyperplasia (BPH) (Taheri et al., 2017) has been reported. For instance, women who are carriers of this polymorphism, have an increased risk of BC in both codominant and recessive inheritance models (Bayram et al., 2016). With regard to the location of this SNP, it seems that this SNP has effect on the regulation of HOTAIR gene in the cell. So, the analysis of HOTAIR dysregulation and its correlation with this SNP can be proposed in different types of cancers in different population.

rs920778 is another polymorphism which is located in the intronic enhancer of HOTAIR gene. TT genotype of this SNP has been found to affect the gene expression and make the risk for various cancers (Bayram et al., 2015) such as gastric (Pan et al., 2016), esophageal squamous cell carcinoma (Zhang et al., 2014), cervical (Qiu et al., 2016), and papillary thyroid carcinoma (Zhu et al., 2016). In addition, CC genotype of this SNP might be a cause of breast cancer in both codominant and recessive inheritance genetic models (Bayram et al., 2015).

There are some studies reporting the association between the dysregulation of HOTAIR and rs920778. HOTAIR up-regulation has been suggested as a result of rs920778 in gastric cancer (Xu et al., 2013 Pan et al., 2016). Also, the aberrant expression of HOTAIR in esophageal squamous cell carcinoma seems to be the result of a specific allele of rs920778 (Gupta et al., 2010 Zhang et al., 2014 Dai et al., 2017). Furthermore, there is higher expression of HOTAIR in female papillary thyroid carcinoma tissues because of a specific genetic polymorphism of this gene (Zhu et al., 2016).

Another SNP annotated as rs4759314 is also located in a promoter region in one of the introns of HOTAIR. It is of noted that AG/GG genotypes of the rs4759314 were associated with gastric cancer risk. The expression effects of heterozygotes individuals with G allele were more than homozygotes in the patients in co-dominant models (Du et al., 2015). However, in a controversial report, the HOTAIR gene expression found to be higher in ovarian cancer patients with AG/AA genotypes of rs4759314 (Wu et al., 2016).

Another SNP located in the intronic region of HOTAIR is rs1899663. Due to its location in a putative regulatory element, it seems that this SNP can affect gene expression and regulation. There are some association between HOTAIR rs1899663 T allele and BPH (Benign prostate hyperplasia) patients. Also, The rs1899663 is associated with prostate cancer risk in co-dominant, dominant and recessive inheritance models. Researchers have reported that this SNP changes the affinity for binding of PAX-4, SPZ1, and ZFP281 transcription factors which can alter the HOTAIR gene expression level (Taheri et al., 2017).

Among the SNPs in HOTAIR gene, one named “rs7958904” is an exonic polymorphism. So, it seems that HOTAIR rs7958904 polymorphism can affect the secondary structure of HOTAIR.

It is of noted that CC genotypes of HOTAIR rs7958904 has been reported to be associated with decreased osteosarcoma (Zhou et al., 2016), EOC (Wu et al., 2016), and colorectal cancers risk (Xue et al., 2014). In an study on osteosarcoma patients classified by age, gender, and tumor locations, it was shown that CC genotypes of the HOTAIR rs7958904 can reduce osteosarcoma risk as well as HOTAIR expression level (Zhou et al., 2016). However, cervical cancer patients with CC genotypes of this SNP had higher HOTAIR expression (Jin et al., 2017). Furthermore, with regard to the up-regulation of HOTAIR in lung cancer (Jiang et al., 2017) the SNP has been reported as a region to be associated with chemotherapy response in lung cancer patients through effect on HOTAIR function or expression (Xue et al., 2014 Gong et al., 2016).

HOTAIR have abnormal expression in the different human cancers. Different studies have revealed the cellular and molecular mechanisms in which HOTAIR is involved (Hajjari and Salavaty, 2015 Gong et al., 2016). Recently, some studies indicating the potential role of SNPs of HOTAIR in cancer susceptibility have been published. However, these studies are mostly derived from Asian population. Also, there are some controversial results on this field of study. With regard to the importance of HOTAIR regulation and function, more experiments on different populations, and ethnics are expected to reveal the importance of HOTAIR polymorphisms. Other polymorphisms in HOTAIR gene such Indel and CNV may be considered in future. However, the association between these SNPs and regulation/structure of HOTAIR has to be checked in various cancers. Also, we believe that whole genome sequencing projects can help to find the relation between the SNPs of this RNA with other SNPs in different cancers in future.


Hotelling's T(2) multilocus association test

IMPORTANT This command has been temporarily disabled

For disease-traits, PLINK provides support for a multilocus, genotype-based test using Hotelling's T2 (T-squared) statistic. The --set option should be used to specify which SNPs are to be grouped, as follows:

Plink --file data --set mydata.set --T2

where mydata.set defines which SNPs are in which set (see this section for more information on defining sets).

This command will generate a file which contains the fields

HINT Use the --genedrop permutation to perform a family-based application of the Hotelling's T2 test. This command can be used with all permutation methods (label-swapping or gene-dropping, adaptive or max(T)). In fact, the permutation test is based on 1-p in order to make the between set comparisons for the max(T) statistic more meaningful (as different sized sets would have F-statistics with different degrees of freedom otherwise). Using permutation will generate one of the following files: which contain the fields or, if --mperm was used, which contain the fields Note that this test uses a simple approach to missing data: rather than case-wise deletion (removing an individual if they have at least one missing observation) we impute the mean allelic value. Although this retains power under most scenarios, it can also cause some bias when there are lots of missing data points. Using permutation is a good way around this issue.


Methods

Study populations

Two independent Australian Caucasian breast cancer case populations were available for our study: The Genomics Research Centre Breast Cancer (GRC-BC) population and part of the Griffith University-Cancer Council Queensland Breast Cancer Biobank (GU-CCQ BB). We conducted single nucleotide polymorphism genotyping in the GRC-BC population initially. This consisted of DNA samples from 173 breast cancer patients from South East Queensland and DNA samples from 187 healthy age and sex matched females with no personal and/or familial history of breast, ovarian or any other type of cancer collected at the Genomics Research Centre Clinic, Southport, with research approved by Griffith University’s Human Ethics Committee (Approval: MSC/07/08/HREC and PSY/01/11/HREC) and the Queensland University of Technology Human Research Ethics Committee (Approval: 1400000104). Breast cancer samples comprised prevalent breast cancer cases diagnosed previous to their inclusion in this study. All participants supplied informed written consent. Average age of test population was 57.52 years and 57 years for cases and controls respectively.

Further validation of genotyping results was performed on a subset of the GU-CCQ BB population. 679 DNA samples from breast cancer patients residing in Queensland with a diagnosis of invasive breast cancer confirmed histologically were used to validate genotyping of miR-SNPs. Patient samples had been collected by the Genomics Research Centre in collaboration with the Cancer Council of Queensland as part of a 5-year population-based longitudinal study since January 2010. Patients included in this study were between 33 and 80 years of age, with an average age of 60.16 and they were screened for personal and/or familial history of breast, ovarian or any other type of cancer. Control population for the GU-CCQ BB was established from 2 sources: The control group for this cohort was comprised of genotyping result data taken from 201 healthy females belonging to the phase 1 European population from the 1000Genomes project. Efforts were made to select a subgroup of individuals that were comparable to the case group in terms of age, ethnicity and sex [34].

Genomic DNA sample preparation from whole human blood

Genomic DNA was extracted from whole blood samples using a modified salting out method described previously [35, 36]. DNA samples were evaluated by spectrophotometry using the Thermo Scientific NanoDrop™ 8000 UV-Vis Spectrophotometer (Thermo Fisher Scientific Inc., Wilmington, DE. USA) to determine DNA yield and 260/280 ratios [37–39]. Samples with a reading below 1.7 for their 260/280 ratio were purified using an ethanol precipitation protocol to guarantee DNA sample purity [40].

MiRNA SNP selection

Figure 1 shows the selection process we followed to determine miRNA SNPs (miR-SNPs) that could be included in our study. Two datasets, “The whole miRNA-disease association data” and “The miRNA function set data” from the human miRNA disease database (HMMDD) created by Lu et al. [41] and updated in January 2012, were used to select 8 diseases and/or pathological characteristics and 24 biological and/or cellular functions related to breast cancer (See Table 1). As shown in Fig. 1, we picked the 50 miRNA genes from each dataset that were present in the majority of selected features for inclusion in the following steps. This list was narrowed down to the 25 miRNA genes on each dataset with the strongest evidence in order to maximise the potential for identification of biologically relevant molecules using two main criteria: miRNAs involved in the largest number of selected features from each group followed by a literature search to confirm the number of publications showing significant relationships to cancer biology or the possession of known functional effects of polymorphisms within the miRNA itself. Following this, we chose 10 miRNA genes from the 25 genes on both lists, again prioritising by number of functions and publications, and conducted a search to identify SNPs using both dbSNP database from The National Center for Biotechnology Information (NCBI) [42] and 1000 Genomes project browser [43]. Final selection of SNPs was done using this algorithm: All microRNA-SNPs located inside the pre-miRNA gene were automatically included in the SNP selection. However, SNPs located outside of the pre-miRNA gene were assessed using the following criteria: miR-SNPs located up to 500bp upstream or downstream from pre-miRNA were automatically included in the SNP selection. On the other hand, SNPs located more than 500bp from the 3’ or 5’ end were chosen only if they had a previously reported minor allele frequency higher than 5% in Caucasian populations. As a result 56 microRNA SNPs were identified in this preliminary selection (Data not shown) (See Fig. 1).

MicroRNA SNP (miR-SNP) selection algorithm using the Human miRNA Disease Database (HMDD). This flow chart shows workflow for selection of preliminary miR-SNPs included in genotyping study. Abbreviations: dbSNP, single nucleotide polymorphism database MAF, minor allele frequency miRNA, microRNA NCBI National Center for Biotechnology Information SNP, Single nucleotide polymorphisms

Primer design

Using the MassARRAY® Assay Design Suite v1.0 software (SEQUENOM Inc., San Diego, CA, USA) we were able to create a single multiplex PCR genotyping assay containing 24 miR-SNPs from our preliminary selection (See Table 2). We designed forward and reverse PCR primers and one iPLEX® (extension) primer and verified that the mass of extension primers differed by at least 30 Da among different SNPs and by 5 Da between alternative alleles of the same marker to achieve successful marker and allele identification by mass spectrometry analysis. Primers were manufactured by Integrated DNA Technologies (IDT®) Pte. Ltd. (Baulkham Hills, NSW 2153, Australia) and primer information is shown in Table 3.

Primary multiplex PCR

Genotyping was undertaken following the iPLEX™ GOLD genotyping protocol using the iPLEX® Gold Reagent Kit (SEQUENOM Inc., San Diego, CA, USA). Primer extension reactions were performed according to the instructions for the SEQUENOM linear adjustment method included in the iPLEX™ GOLD genotyping protocol (SEQUENOM Inc., San Diego, CA, USA). All reactions were performed using Applied Biosystems® MicroAmp® EnduraPlate™ Optical 96-Well Clear Reaction Plates with Barcode (Life Technologies Australia Pty Ltd., Mulgrave, VIC, Australia) and an Applied Biosystems® Veriti® 96-Well Thermal Cycler (Life Technologies Australia Pty Ltd., Mulgrave, VIC, Australia).

MALDI-TOF MS analysis and data analysis

A total of 12-16 nl of each iPLEX® reaction product were transferred onto a SpectroCHIP® II G96 (SEQUENOM Inc., San Diego, CA, USA) using SEQUENOM® MassARRAY® Nanodispenser (SEQUENOM Inc., San Diego, CA, USA). SpectroCHIP® analysis was carried out by SEQUENOM® MassArray® Analyzer 4 and the SpectroAcquire software Version 4.0 (SEQUENOM Inc., San Diego, CA, USA). Finally data analysis for genotype determination was done using the MassARRAY® Typer software version 4.0 (SEQUENOM Inc., San Diego, CA, USA). In order to confirm the genotypes obtained, randomly selected samples (5 each for case and control cohorts) from each genotype (n = 240) were validated by Sanger Sequencing to ensure accuracy of genotyping results. In all cases, the Sanger Sequencing confirmed the genotyping obtained using MassARRAY.

Statistical analysis

Statistical analysis of genotypes and alleles was conducted using Plink software version 1.07 (http://pngu.mgh.harvard.edu/purcell/plink/) [44]. The α for p-values was set at 0.05 to determine statistically significant association with breast cancer. Genotype and allele frequencies for each miRNA SNP in our case and control populations were established and we used Hardy-Weinberg equilibrium (HWE) to evaluate deviation between observed and expected frequencies for identification of unexpected population or genotyping biases [45, 46]. We performed Chi square analysis to evaluate differences in genotype and allele frequencies between cases and controls for each independent population [47]. Finally we calculated odds ratio (OR) and obtained 95% confidence interval (CI) 95% to assess disease risk.


Author information

Affiliations

International Institute of Tropical Agriculture (IITA), Ibadan, 200001, Oyo State, Nigeria

Ismail Yusuf Rabbi, Siraj Ismail Kayondo, Muyideen Yusuf, Cynthia Idhigu Aghogho, Kayode Ogunpaimo, Ruth Uwugiaren, Ikpan Andrew Smith, Prasad Peteti, Afolabi Agbona, Elizabeth Parkes, Chiedozie Egesi & Peter Kulakow

Boyce Thompson Institute, Ithaca, NY, 14853, USA

National Root Crops Research Institute (NRCRI), PMB 7006, Umudike, 440221, Nigeria

Ezenwaka Lydia & Chiedozie Egesi

Global Development Department, College of Agriculture and Life Sciences, Cornell University, Ithaca, NY, 14850, USA

Section on Plant Breeding and Genetics, School of Integrative Plant Sciences, Cornell University, Ithaca, NY, 14850, USA

Marnin Wolfe & Jean-Luc Jannink

United States Department of Agriculture - Agriculture Research Service, Ithaca, NY, 14850, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Contributions

IYR, CE, JLJ, and PK conceived and designed the study IYR, SIK, GB, AA, and MY performed analyses and wrote the manuscript CE, EL, EP, MW, JLJ, and PK edited the manuscript CA, KO, RU, ASI, and PP Implemented field trials, generated and curated data and PK Provided overall coordination and leadership.

Corresponding author


Watch the video: AG2PI Workshop #3 - Introduction to SNP Data Analysis (December 2021).