Information

What is the best method to check for differential gene expression in large dataset? I want to identify aging differentially expressed genes


My data is RNAseq data and I want to know which all genes (which I am looking for) are differentially expressed with age. I have tried the Kruskal-Wallis test and fold change methods. Do you have any better suggestions?


A standard pipeline for DGE would be Salmon for (pseudo)mapping + DESeq2 for statistical analysis.

Salmon is one of a set of modern, fast and accurate mapping software. Requires a transcriptome to be defined.

DESeq2 is a mature R (Bioconductor) package which can handle reasonably complex designs, including time series, but not mixed models. The vignettes are pretty thorough. It can handle data from pseudomapping software as well as data generated using more traditional methods, e.g. Star + FeatureCounts.

Using a pre-made RNAseq statistical package is a much better idea than trying to do, e.g. a Poisson/Neg.bin GLM manually for each gene (even though ultimately this is what DESeq2 does).

(agree this question is a better fit for SE Bioinf)


Background

Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.

Results

We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.

Conclusion

Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at http://people.pcbi.upenn.edu/


Experimental data

The data used in this workflow is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. &ldquoRNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.&rdquo PLoS One. 2014 Jun 139(6):e99625. PMID: 24926665. GEO: GSE52778.


Materials and Methods

RNA sequencing data

Level 3 de-identified data for prostate cancer samples and all available non-malignant samples from these prostate cancer patients was downloaded from The Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov). Level 3 describes data that has been processed and aggregated to give gene expression signals for a sample. For each sample, the data contains expression counts for up to 20,531 coding and non-coding RNA transcripts plus clinical information such as age, stage, Gleason score, PSA level, and race/ethnicity. Before analysis, tumor and non-malignant samples were randomly pulled to achieve an age- and stage-matched pool of 225 samples (S1 Table). A total of 173 prostate cancer samples and 52 non-malignant samples from 204 unique patients were analyzed. The patient clinical information is presented in Table 1.

Differential Gene Expression

The R programming environment (version 3.1.2) [61] was used to process raw data, perform statistical calculations, and perform differential expression analysis. After age- and stage-matching, 393 transcripts were removed because they lacked expression in the 225 samples comprising the dataset. The RNA counts for the remaining 20,138 transcripts were rounded to the nearest whole number and compiled into a matrix to build the dataset. The magnitude of expression changes relative to non-malignant samples was also calculated by taking the base 2 logarithm of the tumor/non-malignant mean expression ratio. For genes with no expression in either the tumor or non-malignant samples, the log2 fold changes were adjusted by adding one to each mean and then calculating the ratio. All log2 values quoted are values after any such adjustments. Negative fold changes indicated down-regulation in tumor samples whereas positive values indicated up-regulation. The R package DESeq2 (version 1.6.3) [62] was used to identify DEGs in the TCGA patient RNA data. The computing was done on the Florida State University High Performance Computing Cluster. DESeq2 returned a P-value determined by Wald statistics and an adjusted P-value (Q-value) to correct for multiple comparisons testing using the Benjamini-Hochberg method to determine the false discovery rate (FDR). DEGs were defined as genes different with a FDR less than 1% (Q < 0.01).

To evaluate the significance of the identified DEGs, analyses were conducted to search for overrepresented pathways, gene set enrichment, and signaling pathway impact. First, overrepresented elements were identified among the DEGs. The Protein ANalysis THrough Evolutionary Relationships (PANTHER) Classification System and analysis tools were used to categorize DEGs by PANTHER protein class, Gene Ontology (GO) Molecular Function, and GO Biological Process to then determine if any of these classes or GO terms were overrepresented [63]. The PANTHER Overrepresentation Test (release 20150430) was used to search the data against the PANTHER database (PANTHER version 10.0 Released 2015-05-15) and the GO database (Released 2015-05-09) to identify either protein classes or GO annotations overrepresented in our data when compared to a reference human genome. P-values were adjusted using a Bonferroni correction.

Pathway Analysis

Gene Set Enrichment Analysis (GSEA) [64] was used to identify groups of genes enriched in either the tumor or non-malignant condition. The GSEA analysis tool (version 2.2.0) was downloaded from the Broad Institute website (http://www.broadinstitute.org/gsea/index.jsp). Curated gene sets of BioCarta and Reactome pathways were downloaded from the Broad Institute’s Molecular Signatures Database. An additional gene set was constructed from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [65]. Pathways with the least relevance to prostate cancer were excluded. The KEGG pathways included in the analysis are listed in the Supporting Information (S2 Table). The entire RNA expression count matrix was loaded into the GSEA application without limiting the input to only DEGs. Both small (< 5 genes) and large (> 500 genes) gene sets were excluded from the analysis.

Signaling Pathway Impact Analysis (SPIA) was used to assess the importance of enriched pathways in terms of their impact and ability to activate or inhibit a pathway [66]. SPIA analysis was accomplished using the R package “SPIA” (version 2.18.0) [67]. Entrez IDs, log2 fold changes, and Q-values for all genes were compiled. The differential expression cut-off used in the SPIA algorithm was based on the FDR-adjusted Q-value. The analysis was run using the same tailored list of pathways as used in GSEA (S2 Table) and updated versions of these pathways were download prior to running the analysis (accessed 7/29/2015).


Results

Meta-analysis of differential expression

We assessed the global levels of gene expression across datasets by assigning rank values to each gene based on its mean expression value within a dataset. While we observe variation between studies, there still emerges a clear pattern of genes which are consistently strongly or weakly expressed in the brain (data not shown) supporting the feasibility of comparing studies to one another.

We used linear regression to evaluate gene expression changes with respect to four factors (age, sex, brain pH, and PMI), within each dataset. We considered both directions of change (up- and down-regulation) for each factor, creating up to eight different scenarios for each dataset. Although the focus of this paper is to report on the results from the meta-analysis across datasets, we briefly summarized here the results from individual studies. The numbers of genes that show evidence change with each of the factors (q < 0.01) is given in Table 2 . Not surprisingly, the datasets with smaller sample size showed fewer statistically significant changes associated with the factors. Overall the factors associated with the most robust differential expression were age and brain pH. Genes found to change in individual datasets but not from meta-analysis can be found in Supplementary Table 7 at http://chibi.ubc.ca/

Table 2

Significant genes (qπ.01) identified within each individual dataset

DatasetAgeSexpHPMI
DownUpFemaleMaleDownUpDownUp
Stanley Chen00270000
Stanley Kato00110703000
<"type":"entrez-geo","attrs":<"text":"GSE1572","term_id":"1572">> GSE15721626416n/an/a00
<"type":"entrez-geo","attrs":<"text":"GSE2164","term_id":"2164">> GSE216400120000
<"type":"entrez-geo","attrs":<"text":"GSE8919","term_id":"8919">> GSE8919153133619n/an/a2278
<"type":"entrez-geo","attrs":<"text":"GSE11512","term_id":"11512">> GSE11512003131000
<"type":"entrez-geo","attrs":<"text":"GSE11882","term_id":"11882">> GSE118824280214n/an/a10
<"type":"entrez-geo","attrs":<"text":"GSE13162","term_id":"13162">> GSE131620000n/an/a00
<"type":"entrez-geo","attrs":<"text":"GSE5390","term_id":"5390">> GSE539011n/an/an/an/a00
McLeanPFC001500n/an/a
<"type":"entrez-geo","attrs":<"text":"GSE3790","term_id":"3790">> GSE379000113n/an/an/an/a
‘Union’ Signature689198103370402278
Meta-signature Overlap4151026191910146

Q-values were calculated from regression p-values for each factor within each dataset. The number of genes reported here are significant at q < 0.01. The ‘union’ signature represents the union of unique genes identified for each factor across all datasets. The meta-signature overlap indicates the number of union signature genes overlapping with the corresponding meta-signature.

To examine changes in gene expression that were consistent across all datasets, or supported by evidence from multiple data sets, we implemented a cross-study meta-analysis approach (see Methods). The output of this analysis is eight meta-signatures (up or down for each of the four factors). The top ten genes from each meta-signature can be found in Table 7 . To examine the results, we first extracted the corresponding p-values from each individual dataset and visualized them as (smoothed) plots in the order determined by the meta-signature order ( Figure 1 ). We observe that genes that have good meta-q-values tended to have good p-values in multiple, but not necessarily all studies. We also demonstrate this using an alternative heat map visualization (Supplementary Figure 1). More detailed results are plotted for some example genes in Figure 2 . These plots show that p-values for a given gene vary across individual datasets, and the meta-analysis can identify genes which show only weak or non-significant effects in some data sets. For example, in Figure 2 , for age genes Gfap and Rgs4, we see weak changes in expression level (up and down, respectively), that are not significant after multiple test correction in those studies. On the other hand, we also find genes that show significant effects in most if not all studies (e.g., Xist, Figure 2 ). In Figure 3 we assembled the top 50 genes down-regulated with age, and plotted expression levels within each dataset with samples ordered by increasing age. For most of the studies, we observe a gradient across the dataset as expression decreases from high to low levels, illustrating that the meta-analysis recovers many genes which show fairly consistent trends across data sets.

For each dataset used, gene p-values were plotted against the corresponding meta-q value and a loess fit was computed to generate a smooth curve between points. The fact that most data sets show a rise in p-values correlated with the meta-q-values indicates the contribution of signals of varying strengths to the meta-signatures. An alternative view of the data using heat maps is available in the supplement. The distorted curves for gender are due to the strong effects of a small number of genes with very small meta-q-values (note the difference in scale of D compared to A𠄼).

For selected genes from each of the meta-signatures we have plotted the log regression p-values from each dataset. Open circles represent the datasets for which the gene was found to be significant after multiple test correction (q < 0.01). Dashed line indicates a per-study p-value significance level of 0.05 for reference.

The top 50 age down-regulated genes were selected based on meta-analysis q-value ranking. For each gene, the corresponding data from each study was extracted and converted to a heat map. Expression values were normalized across samples within each dataset, and ordered by age. Age is plotted at the top of each heat map. Light values in heat map indicate higher expression. Grey bars indicate missing values. All data sets are at approximately the same horizontal scale except the last, which is compressed to fit on the page.

Table 7

Age Down-regulatedAge Up-regulated
OLFM1olfactomedin 1NEBLnebulette
KCNF1potassium voltage-gated channel, subfamily F, member 1MED12mediator complex subunit 12
RGS4regulator of G-protein signaling 4BCL2B-cell CLL/lymphoma 2
PPP3CBprotein phosphatase 3 (formerly 2B), catalytic subunit, beta isoformGMPRguanosine monophosphate reductase
ADCY2adenylate cyclase 2 (brain)GFAPglial fibrillary acidic protein
SVOPSV2 related protein homolog (rat)ADH1Balcohol dehydrogenase 1B (class I), beta polypeptide
EFNB3ephrin-B3WWOXWW domain containing oxidoreductase
ATP2B2ATPase, Ca++ transporting, plasma membrane 2PLEC1plectin 1, intermediate filament binding protein 500kDa
HPCAhippocalcinVCANversican
CALB1calbindin 1, 28kDaAHCYL1S-adenosylhomocysteine hydrolase-like 1
pH Down-regulatedpH Up-regulated
FGF2fibroblast growth factor 2 (basic)SLC1A1solute carrier family 1 (neuronal/epithelial high affinity glutamate transporter, system Xag), member 1
AHCYL1S-adenosylhomocysteine hydrolase-like 1LARGElike-glycosyltransferase
DTNAdystrobrevin, alphaC17orf81chromosome 17 open reading frame 81
MAPKAPK2mitogen-activated protein kinase-activated protein kinase 2HISPPD2Ahistidine acid phosphatase domain containing 2A
TJP1tight junction protein 1 (zona occludens 1)PRKCDprotein kinase C, delta
S100A13S100 calcium binding protein A13DLG3discs, large homolog 3 (Drosophila)
RBBP6retinoblastoma binding protein 6KCNAB1potassium voltage-gated channel, shaker-related subfamily, beta member 1
GNG12guanine nucleotide binding protein (G protein), gamma 12SLC8A1solute carrier family 8 (sodium/calcium exchanger), member 1
ANP32Eacidic (leucine-rich) nuclear phosphoprotein 32 family, member EGABRA5gamma-aminobutyric acid (GABA) A receptor, alpha 5
BAALCbrain and acute leukemia, cytoplasmicRIT2Ras-like without CAAX 2
Female Up-regulatedMale Up-regulated
XISTX (inactive)-specific transcript (non-protein coding)JARID1Djumonji, AT rich interactive domain 1D
HDHD1Ahaloacid dehalogenase-like hydrolase domain containing 1AUSP9Yubiquitin specific peptidase 9, Y-linked (fat facets-like, Drosophila)
UTXubiquitously transcribed tetratricopeptide repeat, X chromosomeEIF1AYeukaryotic translation initiation factor 1A, Y-linked
JARID1Cjumonji, AT rich interactive domain 1CCYorf15Bchromosome Y open reading frame 15B
TSIXXIST antisense RNA (non-protein coding)DDX3YDEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked
USP9Xubiquitin specific peptidase 9, X-linkedUTYubiquitously transcribed tetratricopeptide repeat gene, Y-linked
LOC554203alanyl-tRNA synthetase domain containing 1 pseudogeneRPS4Y1ribosomal protein S4, Y-linked 1
STSsteroid sulfatase (microsomal), isozyme STTTY15testis-specific transcript, Y-linked 15
ZFXzinc finger protein, X-linkedCYorf15Achromosome Y open reading frame 15A
PNPLA4patatin-like phospholipase domain containing 4TMSB4Ythymosin beta 4, Y-linked
PMI Down-regulatedPMI Up-regulated
BRD8bromodomain containing 8GOSR2golgi SNAP receptor complex member 2
RBM5RNA binding motif protein 5CYB5Bcytochrome b5 type B (outer mitochondrial membrane)
PUM2pumilio homolog 2 (Drosophila)SNF8SNF8, ESCRT-II complex subunit, homolog (S. cerevisiae)
ARHGEF7Rho guanine nucleotide exchange factor (GEF) 7GRLF1glucocorticoid receptor DNA binding factor 1
NCOA3nuclear receptor coactivator 3C6orf1chromosome 6 open reading frame 1
DAPK1death-associated protein kinase 1MGMTO-6-methylguanine-DNA methyltransferase
SORL1sortilin-related receptor, L(DLR class) A repeats-containingMAXMYC associated factor X
KIAA0841KIAA0841ST3GAL2ST3 beta-galactoside alpha-2,3-sialyltransferase 2
CAMK2Gcalcium/calmodulin-dependent protein kinase II gammaPTGER3prostaglandin E receptor 3 (subtype EP3)
DIP2CDIP2 disco-interacting protein 2 homolog C (Drosophila)EXT1exostoses (multiple) 1

For each meta-signature we have listed the top ten genes, as ranked by meta-q-value (all at q < 0.001). For each gene we have listed the gene symbol an gene name. Complete meta-signature lists for each factor can be found in Supplementary Table 6 http://chibi.ubc.ca/

While we have presented results from an analysis which treated each factor independently (linear regression), we also performed a meta-analysis which models the factors simultaneously in an analysis of covariance, yielding very similar results (available at http://chibi.ubc.ca/

mmistry/). We also attempted to model interactions among factors, but this was difficult due to lack of power.

Meta-profile evaluation

We tested the robustness of our meta-profiles by using a jackknife procedure. This involved sequentially removing a dataset, performing the meta-analysis on the remaining datasets, and then selecting genes at a meta-q < 0.01. This procedure was repeated for each data set in turn, and genes found in all rounds retained as a 𠇌ore” signature. Each of the core signatures encompass more than half of the genes found in the corresponding meta-profiles, with the pH meta-profiles as an exception. The 𠇜ore” signatures can be found in Supplementary Table 6 at http://chibi.ubc.ca/

The studies we selected for meta-analysis were, in general, not designed to test the effects of age, sex, pH or PMI in fact attempts may have been made to limit the range of these factors (especially in the case of PMI and pH). However, because human brain samples are difficult to obtain, enough variability is allowed into studies to allow us to perform a meaningful meta-analysis. We still questioned the extent to which our results would agree with more targeted studies. Therefore, for the purpose of validation of our approach, we identified independent gene lists from the literature for age, sex and brain pH (we could not find a comprehensive validation set for PMI). Details on validation sets can be found in Supplementary Table 2. Each validation gene list was then separated into two groups based on direction of change, to correspond with meta-signatures obtained from our meta-analysis. Obviously none of these validation gene lists can be considered true gold standards, but help put our results in the context of previous findings.

To quantify the predictive power of our analysis meta-signatures with respect to the corresponding validation sets, we first preformed a standard receiver operating characteristic (ROC) analysis. The score reported for each profile is the area under the ROC curve (AUC), a value between 0 and 1, where 1.0 is perfect agreement with the external list and 0.5 would reflect a random order. The AUC values for each of the profiles are reported in Table 3 . Individual ROC plots can be found in Supplementary Figure 2. We also tested the effect of using a specific statistical threshold for selecting genes from the meta-profiles, by collecting genes at two significance levels (meta-q π.01, and meta-q < 0.001). The overlap with the validation set was significant (pπ.001, Fisher s exact test Table 3 ) for all signatures. We also found a comparable overlap between each of the core signatures and the validation sets.

Table 3

Comparison of meta-signature profiles against validation gene sets

No. of profile genes (q < 0.001)Overlap with validation set (q < 0.001)No. of profile genes (q < 0.01)Overlap with validation set (q < 0.01)AUC score
Age Genes
Up-regulated404401241690.74
Down-regulated113411322471360.76
pH Genes
Up-regulated25113681310.91
Down-regulated2155510181220.86
Sex Genes
Male1473870.90
Female1911281n/a
PMI Genes
Up-regulated75n/a691n/an/a
Down- regulated4n/a49n/an/a

The brain pH profiles gave high AUC scores of 0.91 and 0.86 (up- and down-regulated genes, respectively), when validated against DLPFC pH-sensitive genes taken from Vawter et al. (2006). Additionally, we obtained reasonably high scores using a smaller independent pH gene list obtained from Mexal et al. (2006), despite a difference in the brain region used between the validation study and the meta-analysis (data not shown). Because pH itself probably covaries across brain regions (Mexal et al., 2006), our results are consistent with the hypothesis that pH-related changes in gene expression are similar across brain regions. The age profiles on the other hand, exhibited slightly lower scores than those obtained for brain pH. Erraji-Benchekroun et al. used samples from Brodmann areas 9 (dorsolateral PFC, BA9) and 47 (orbitofrontal PFC, BA47) from each subject, to evaluate age expression differences, showing comparable changes in both brain regions (Erraji-Benchekroun et al., 2005). As such, our validation set consisted of genes showing age expression changes collectively within both neocortical brain regions BA9 and BA47. While many of these genes appear at the top of our ranked lists, some are also dispersed throughout our ranking. The gender meta-signature from our meta-analysis scored high when validated with a set of genes from Galfalvy et al. (2003), with most validation genes sitting at the top of the ranking.

Overlap with schizophrenia candidate genes

We compared significant genes (meta-q < 0.001) from each of our meta-signatures with genes known to be associated with schizophrenia. We extracted a list of 34 schizophrenia candidate genes provided in a comprehensive literature review (Colantuoni et al., 2008), and searched this list of genes within each of our meta-signatures. We found that 12 of these genes identified with at least one of our meta-signatures, though the majority of overlap was observed amongst the age and pH meta-profiles ( Table 4 ). Overlap of genes with each of the age meta-signatures was significant at p < 0.01.

Table 4

Schizophrenia candidate gene analysis

Schizophrenia genes identified in meta-profiles
AgeDownOpcml, Pldn, Nrg1, Rgs4, Bdnf, Dlg4, Gad67
UpNtrk2, Ppp1r1b, Erbb3
pHDownNtrk2, Slc1a2
UpRgs4, Gad67
PMIUpNrg1

Affected Biological Pathways

To derive a high-level biological interpretation of our meta-signatures we performed a Gene Ontology (GO) (Ashburner et al., 2000) enrichment analysis using ErmineJ (Lee et al., 2005). We extracted the ‘top’ GO categories for each of the profiles and a comparison amongst them revealed various 𠆋iological processes’ unique to each meta-signature, in addition to a number of shared categories. In Figure 4 , we have displayed the top ten categories for each age meta-signature depicting the number of genes represented in each GO category and the associated p-value (corrected for multiple testing). The ORA data generated for all eight meta-signatures can be found in Supplementary Table 5 at http://chibi.ubc.ca/

For the age meta-signatures, we have displayed the top 10 GO terms identified using a GO over-representation analysis. The primary y-axis displays the number of meta-signature genes that fall in the given 𠆋iological process’ category, while the secondary axis displays the associated p-value. GO terms were collapsed to parent term if parent and child both appeared in the top ten.

For genes increasing in expression with the progression of age, top GO categories included those involved in cell growth and proliferation, and cell-cell interaction, consistent with previous studies (Erraji-Benchekroun et al., 2005, Hong et al., 2008). Other processes included the insulin receptor signaling pathway, encompassing a number of genes involved in longevity and aging (Bartke, 2008). The age down-regulated genes presented an enrichment in synaptic and/or receptor activity with GO categories such as neuron recognition, neurotransmitter transport and secretion, and regulation of neurotransmitter levels. This finding is concordant with existing aging studies in mouse and human (Erraji-Benchekroun et al., 2005, Oh et al., 2009). Similarly, we found an enrichment of genes involved in neurotransmitter secretion in the pH up-regulated meta-signature, in addition to genes implicated in metabolism and a different array of pathways (i.e. G-protein signaling). The female and male meta-signatures identified a similar enrichment of terms including relevant processes such as sex determination. However, the genes associated with these shared terms were quite different between the two meta-signatures. Functional enrichment analysis of our meta-signatures, although far from providing hard cellular evidence, still provided a useful indication of the biological processes altered by each factor and greater insight at the molecular level.

Correlation between factors and meta-profiles

Because we analyzed each factor independently, we wished to check whether factors were correlated with each other across the 415 samples ( Table 5 ). Age and PMI displayed the highest correlation of 0.35, consistent with a positive correlation reported in (Galfalvy et al., 2003). Age and brain pH displayed a slight negative correlation of 𢄠.2. Further investigation of these two factors revealed that categorizing the values of age into ‘young’ (< 50 years of age) and ‘old’ (≥ 50 years of age) groups resulted in a lower mean pH in the ‘old’ group versus the ‘young’. This was the general trend within each dataset (Supplementary Figure 3).

Table 5

Rank correlations between factors

AgeSexpHPMI
Age
Sex𢄠.12 **
pH𢄠.2 * 0.17
PMI0.35 *** 𢄠.2 *** 𢄠.06

Spearman rank correlations were computed using sample characteristic information for individual subjects.

Due to these correlations, we expected that individual genes in some meta-profiles might overlap with other meta-profiles ( Table 6 ). Accordingly, the two factors displaying the highest number of overlapping genes were those [up or down] in age and those [down or up] with brain pH, respectively. However, these effects were weak and were even weaker in the 𠆌ore’ signatures (Supplementary Table 4). We also found that a number of genes up-regulated with PMI were also identified amongst the profiles for brain pH and age in both directions, but were unable to extract any definite patterns.

Table 6

Evaluating gene overlap between meta-signatures

No. of Profile genesAgepHPMISex
UpDownUpDownUpDownFemaleMale
AgeUp404
Down11342
pHUp25018
Down2157510
PMIUp7523202
Down412000
SexFemale19410100
Male140100000

Using genes for each meta-profile at q < 0.001, we compared them against one another to evaluate the overlap and potential relationships between the factors. The observation of two genes in the age meta-signature that change in both directions is a consequence of multiple probes of different specificity for the same gene, which show different directions of change.


Background

The pace of data generation in the life sciences is steadily increasing. Primary data sets grow in depth and accuracy, covering more and more aspects of life. In molecular biology and biomedicine, these include large-scale measurements of DNA/Histone acetylation, transcriptional activity, gene expression and protein abundance (e.g. [1]). Measuring epigenetic patterns (DNA methylation, DNA/Histone acetylation) on a large scale has become possible only recently [1, 2]. Measuring transcription is entering a new era with the introduction of deep (or next-generation, RNA-seq) sequencing [3, 4]. Proteomics is becoming possible at unprecedented depth, covering ever-larger parts of the proteome on a routine basis [5]. For these primary data, repositories such as the Gene Expression Omnibus database (GEO [6]) or ArrayExpress [7] are constantly expanding.

Often, measurements are differential: they are made for two or more conditions (such as gene knockdown or knockout [8]), for two or more time points (such as time series tracking the consequences of some experimental intervention, [9]), or for two or more species (such as mouse and human, [10]). Exploiting differential measurements is one key to cope with the flood of data, by focusing on the most pronounced differences.

Life scientists also have to handle a deluge of secondary data, in the form of papers, reviews and curated databases. These may be integrated by automated systems such as STRING [11], or by manual efforts [12–14]. Exploiting secondary data provides another key to cope with the flood of primary data, by putting them into context and focusing on the most pronounced confirmations and contradictions to what is known already.

In this paper, we propose to interpret differential data in the context of knowledge, yielding the 'essence' of an experiment. Differential data may be provided by two microarrays, and knowledge may be provided by a network describing gene/protein interaction and regulation. In this case, data tracking gene expression in the course of an experiment can be used to identify the most pronounced putative mechanisms. They are identified as those known links between genes/proteins along which expression changes indicate that there may have been some regulatory change, such as the startup or shutdown of an interaction, a stimulation or an inhibition. ExprEssence highlights these links, and it enables the user to filter out all links with no or negligible change. The higher the filter threshold on the amount of change to be displayed, the fewer links are shown, making it straightforward to examine the 'essence' of the experiment. Network condensations are illustrated by pairs of figures (original network - condensed network) in the section on Case Studies. The condensed network contains good candidates for interpreting the experiment in mechanistic terms, giving rise to the design of new experiments. However, all inferences are hypotheses derived from correlations in the experimental data in the context of the a priori knowledge encoded in the network, and it must be kept in mind that correlative data do not necessarily entail mechanistic causality. Moreover, the validity of the hypotheses generated by our method will depend on the coverage and correctness of the network, and on the accuracy of the experimental data.

Related Work

Starting with the pioneering work of Ideker et al. [15], there is a plethora of methods that combine network data with high-throughput data (such as microarrays), in order to highlight pathways or subnetworks, see the excellent recent reviews of Minguez & Dopazo [16], Wu et al. [17] and Yu & Li [18]. Notably, few of these methods are readily available as publicly accessible software packages, plugins or web services (see Table and in [17]). Also, there does not seem to be a gold standard that can be used for validation purposes (see, e.g., Tarca et al. [19] for a recent discussion). Some methods lack validation except for the example for which they were developed for, while others are studied for an array of specific examples. In these cases, strong enrichment in plausible Gene Ontology categories or detection of known pathways or annotations is often used to demonstrate utility, as in [19–25]. We found two articles including a comparison of different subnetwork identification methods. The first one by Parkkinen and Kaski [26] introduces variants of the Interaction Component Model (ICM) method, comparing them to the original ICM method, to a method based on hidden modular random fields (HMoF) [27] and to Matisse [28], using identification of Gene Ontology classes and coverage of protein complexes for two selected data sets (osmotic shock response and DNA damage data) to judge one method over the other. An evaluation of ClustEx [29], jActiveModules [15], GXNA [21] and a simple approach based on fold change can be found in [29], taking identification of gene sets, pathways and microarray targets known from the literature and from the Gene Ontology for comparison.

In general, it is exceedingly difficult to validate the detection of (sub-) networks or (sub-) pathways: these are complex entities, and ultimate experimental validation is impossible because of this complexity: experimentalists are usually limited to investigating only few components in isolation at any given time. Nevertheless, we will compare results of our method with results obtained by jActiveModules, in a separate section following the case studies. In contrast, by just highlighting single links in networks, we tackle a more primitive task, but in this case results can be validated directly by experiment, or by identifying corroborative statements in the literature. In particular, as can be seen from our case studies, the single links that we highlight give rise to predictions about single genes and about single one-step mechanisms that can be investigated in isolation. Therefore, we would like to emphasize the direct utility of our focus on single links and genes, complementing the (sub-)network centric view that is usually employed to the best of our knowledge, the 'single link and gene' focus is not employed by other methods combining network and high-throughput ('omics') data. In fact, we propose a 'winning combination' of 'network'/'omics' and 'classical' biology, using networks and high-throughput data to highlight single genes and links that may then be validated directly by classical molecular biology, as will be demonstrated in our case studies.

As future work, our formula for link highlighting can, however, be integrated into current methods for pathway/subnetwork detection, possibly improving these considerably. In particular, no such method treats inhibitions and stimulations in a distinct way, as we do. In particular, we envision that the edge score formula of Guo et al. [20], which is based on measuring co-variance, may be replaced by our formula (see below), emphasizing a different aspect of differential gene expression: While Guo et al. identify coordinated changes using their formula, integration of our formula into their framework would identify subnetworks with changes that are consistent with an input network of interactions, stimulations and inhibitions. In any case, we wish to stress that for the identification of coordinated changes, correlation coefficients are most suitable. Our approach, however, identifies a different biological message, namely startups/shutdowns of interactions, stimulations and inhibitions, using an input network that is informative about biological relationships such as stimulations and inhibitions.


RESULTS

MACAU overview

Both of the random effects terms |$oldsymbol$| and |$oldsymbol$| model over-dispersion, the extra variance not explained by a Poisson model. However, the two terms |$oldsymbol$| and |$oldsymbol$| model two different aspects of over-dispersion. Specifically, |$oldsymbol$| models the fraction of the extra variance that is explained by sample non-independence while |$oldsymbol$| models the fraction of the extra variance that is independent across samples. For example, let us consider a simple case in which all samples have the same sequencing depth (i.e. |$ = N$|⁠ ) and there is only one intercept term |$mu$| included as the covariate. In this case, the random effects term |$oldsymbol$| models the independent over-dispersion: without |$oldsymbol$|⁠ , |$V ( y ) = E( y )( <1 + E( y )( <>> - 1> )> )$| is still larger than the mean |$E( y ) = N/2>>$|⁠ , with the difference between the two increasing with increasing |$$|⁠ . In a similar fashion, the random effects term |$oldsymbol$| models the non-independent over-dispersion by accounting for the sample covariance matrix |$oldsymbol$|⁠ . By modeling both aspects of over-dispersion, our PMM effectively generalizes the commonly used negative binomial model—which only models independent extra variance—to account for sample non-independence. In addition, our PMM naturally extends the commonly used LMM ( 9, 64, 83, 84) to modeling count data.

Our goal here is to test the null hypothesis that gene expression levels are not associated with the predictor variable of interest, or |$:eta = < m< >>0$|⁠ . Testing this hypothesis requires estimating parameters in the PMM (as has previously been done in other settings ( 85, 86), including for modeling uneven RNAseq read distribution inside transcripts ( 13) details in Supplementary Data ). The PMM belongs to the generalized LMM family, where parameter estimation is notoriously difficult because of the random effects and the resulting intractable n-dimensional integral in the likelihood. Standard estimation methods rely on numerical integration ( 87) or Laplace approximation ( 88, 89), but neither strategy scales well with the increasing dimension of the integral, which in our case equals the sample size. As a consequence, standard approaches often produce biased estimates and overly narrow (i.e. anti-conservative) confidence intervals ( 90– 96). To overcome the high-dimensionality of the integral, we instead develop a novel Markov Chain Monte Carlo (MCMC) algorithm, which, with enough iterations, can achieve high inference accuracy ( 97, 98). We use MCMC to draw posterior samples but rely on the asymptotic normality of both the likelihood and the posterior distributions ( 99) to obtain the approximate maximum likelihood estimate |$_j>$| and its standard error se( ⁠|$widehat <<>_j>>$|⁠ ). With |$_j>$| and se( ⁠|$widehat <<>_j>>$|⁠ ), we can construct approximate Wald test statistics and P-values for hypothesis testing ( Supplementary Material ). Although we use MCMC, our procedure is frequentist in nature.

At the technical level, our MCMC algorithm is also novel, taking advantage of an auxiliary variable representation of the Poisson likelihood ( 61– 63) and recent linear algebra innovations for fitting LMMs ( 9, 58, 64). Our MCMC algorithm introduces two continuous latent variables for each individual to replace the count observation, effectively extending our previous approach of using one latent variable for the simpler binomial distribution ( 59). Compared with a standard MCMC, our new MCMC algorithm reduces the computational complexity of each MCMC iteration from cubic to quadratic with respect to the sample size. Therefore, our method is orders of magnitude faster than the popular Bayesian software MCMCglmm ( 100) and can be used to analyze hundreds of samples and tens of thousands of genes with a single desktop PC ( Supplementary Figure S1 ). Although our procedure is stochastic in nature, we find the MCMC errors are often small enough to ensure stable P-values across independent MCMC runs ( Supplementary Figure S2 ). We summarize the key features of our method along with other commonly used approaches in Table 1.

Simulations: control for sample non-independence

We performed a series of simulations to compare the performance of the PMM implemented in MACAU with four other commonly used methods: (i) a linear model (ii) the LMM implemented in GEMMA ( 9, 58) (iii) a Poisson model and (iv) a negative binomial model. We used quantile-transformed data for linear model and GEMMA (see ‘Materials and Methods’ section for normalization details and a comparison between various transformations Supplementary Figure S3 ) and used raw count data for the other three methods. To make our simulations realistic, we use parameters inferred from a published RNAseq dataset on a population of wild baboons ( 7, 71) to perform simulations (‘Materials and Methods’ section) this baboon dataset contains known related individuals and hence invokes the problem of sample non-independence outlined above.

Our first set of simulations was performed to evaluate the effectiveness of MACAU and the other four methods in controlling for sample non-independence. To do so, we simulated expression levels for 10 000 genes in 63 individuals (the sample size from the baboon dataset). Simulated gene expression levels are influenced by both independent environmental effects and correlated genetic effects, where genetic effects are simulated based on the baboon kinship matrix (estimated from microsatellite data ( 7)) with either zero ( |$ = 0.0$|⁠ ), moderate ( |$ = 0.3$|⁠ ), or high ( ⁠|$ = 0.6$|⁠ ) heritability values. We also simulated a continuous predictor variable x that is itself moderately heritable ( |$h_x^2 = 0.4$|⁠ ). Because we were interested in the behavior of the null in this set of simulations, gene expression levels were not affected by the predictor variable (i.e. no genes were truly DE).

Figure 1, Supplementary Figures S4 and 5 show quantile–quantile plots for analyses using MACAU and the other four methods against the null (uniform) expectation, for |$ = 0.6$|⁠ , |$ = 0.3$| and |$ = 0.0$| respectively. When genes are heritable and the predictor variable is also correlated with individual relatedness, then the resulting P-values from the DE analysis are expected to be uniform only for a method that properly controls for sample non-independence. If a method fails to control for sample non-independence, then the P-values would be inflated, resulting in false positives.

QQ-plots comparing expected and observed P-value distributions generated by different methods for the null simulations in the presence of sample non-independence. In each case, 10 000 non-DE genes were simulated with n = 63, CV = 0.3, σ 2 = 0.25, h 2 = 0.6, and hx 2 = 0.4. Methods for comparison include MACAU (A), Negative binomial (B), Poisson (C), GEMMA (D), and Linear (E). Both MACAU and GEMMA properly control for type I error well in the presence of sample non-independence. λgc is the genomic control factor.

QQ-plots comparing expected and observed P-value distributions generated by different methods for the null simulations in the presence of sample non-independence. In each case, 10 000 non-DE genes were simulated with n = 63, CV = 0.3, σ 2 = 0.25, h 2 = 0.6, and hx 2 = 0.4. Methods for comparison include MACAU (A), Negative binomial (B), Poisson (C), GEMMA (D), and Linear (E). Both MACAU and GEMMA properly control for type I error well in the presence of sample non-independence. λgc is the genomic control factor.

Our results show that, because MACAU controls for sample non-independence, the P-values from MACAU follow the expected uniform distribution closely (and are slightly conservative) regardless of whether gene expression is moderately or highly heritable. The genomic control factors from MACAU are close to 1 (Figure 1 and Supplementary Figure S4 ). Even if we use a relatively relaxed q-value cutoff of 0.2 to identify DE genes, we do not incorrectly identify any genes as DE with MACAU. In contrast, the P-values from negative binomial are inflated and skewed toward low (significant) values, especially for gene expression levels with high heritability. With negative binomial, 27 DE genes (when h 2 = 0.3) or 21 DE genes (when h 2 = 0.6) are erroneously detected at the q-value cutoff of 0.2. The inflation of P-values is even more acute in Poisson, presumably because the Poisson model accounts for neither individual relatedness nor over-dispersion. For non-count-based models, the P-values from a linear model are slightly skewed towards significant values, with three DE genes (when h 2 = 0.3) and one DE gene (when h 2 = 0.6) erroneously detected at q < 0.2. In contrast, because the LMM in GEMMA also accounts for individual relatedness, it controls for sample non-independence well. Finally, when genes are not heritable, all methods except Poisson correctly control type I error ( Supplementary Figure S5 ).

Two important factors influence the severity of sample non-independence in RNAseq data (Figure 2). First, the inflation of P-values in the negative binomial, Poisson and linear models becomes more acute with increasing sample size. In particular, when |$h_x^2 = 0.4$|⁠ , with a sample size of |$n = 1,000$|⁠ , |$>$| from the negative binomial, Poisson and linear models reaches 1.71, 82.28 and 1.41, respectively. In contrast, even when |$n = 1,000$|⁠ , |$>$| from both MACAU and GEMMA remain close to 1 (0.97 and 1.01, respectively). Second, the inflation of P-values in the three models also becomes more acute when the predictor variable is more correlated with population structure. Thus, for a highly heritable predictor variable ( ⁠|$h_x^2 = 0.8$|⁠ ), |$>$| (when |$n = 1000$|⁠ ) from the negative binomial, Poisson and linear models increases to 2.13, 101.43 and 1.81, respectively, whereas |$>$| from MACAU and GEMMA remains close to 1 (1.02 and 1.05).

Comparison of the genomic control factor λgc from different methods for the null simulations in the presence of sample non-independence. 10 000 null genes were simulated with CV = 0.3, σ 2 = 0.25, h 2 = 0.6, and (A) hx 2 = 0 (B) hx 2 = 0.4 (C) hx 2 = 0.8. λgc (y-axis) changes with sample size n (x-axis). Methods for comparison were MACAU (red), Negative binomial (purple), GEMMA (blue), and Linear (cyan). Both MACAU and GEMMA provide calibrated test statistics in the presence of sample non-independence across a range of settings. λgc from Poisson exceeds 10 in all settings and is thus not shown.

Comparison of the genomic control factor λgc from different methods for the null simulations in the presence of sample non-independence. 10 000 null genes were simulated with CV = 0.3, σ 2 = 0.25, h 2 = 0.6, and (A) hx 2 = 0 (B) hx 2 = 0.4 (C) hx 2 = 0.8. λgc (y-axis) changes with sample size n (x-axis). Methods for comparison were MACAU (red), Negative binomial (purple), GEMMA (blue), and Linear (cyan). Both MACAU and GEMMA provide calibrated test statistics in the presence of sample non-independence across a range of settings. λgc from Poisson exceeds 10 in all settings and is thus not shown.

We also compared MACAU with edgeR ( 25) and DESeq2 ( 15), two commonly used methods for DE analysis ( 11, 101). Because edgeR and DESeq2 were designed for discrete predictor valuables, we discretized the continuous predictor |$x$| into 0/1 based on the median predictor value across individuals. We then applied all methods to the same binarized predictor values for comparison. Results are shown in Supplementary Figure S6 . For the five methods compared above, the results on binarized values are comparable with those for continuous variables (i.e. Supplementary Figure S6 versus Figure 1). Both edgeR and DESeq2 produce anticonservative P-values and perform similarly to the negative binomial model in terms of type I error control.

Finally, we explored the use of PCs from the gene expression matrix or the genotype matrix to control for sample non-independence. Genotype PCs have been used as covariates to control for population stratification in association studies ( 102). However, recent comparative studies have shown that using PCs is less effective than using LMMs ( 83, 103). Consistent with the poorer performance of PCs in association studies ( 83, 103), using the top PCs from either the gene expression matrix or the genotype matrix does not improve type I error control for negative binomial, Poisson, linear, edgeR or DESeq2 approaches ( Supplementary Figures S7 and 8 ).

Simulations: power to identify DE genes

Our second set of simulations was designed to compare the power of different methods for identifying DE genes, again based on parameters inferred from real data. This time, we simulated a total of 10 000 genes, among which 1000 genes were truly DE and 9000 were non-DE. For the DE genes, simulated effect sizes corresponded to a fixed PVE in gene expression levels that ranged from 15 to 35%. For each set of parameters, we performed 10 replicate simulations and measured model performance based on the AUC (as in ( 35, 68, 104)). We also examined several key factors that could influence the relative performance of the alternative methods: (i) gene expression heritability ( ⁠|$$|⁠ ) (ii) correlation between the predictor variable |$x$| and genetic relatedness (measured by the heritability of |$x$|⁠ , or |$h_x^2$|⁠ ) (iii) variation of the TRCs across samples (measured by the CV) (iv) the over-dispersion parameter ( ⁠|$$|⁠ ) (v) the effect size (PVE) and (vi) sample size (n). To do so, we first performed simulations using a default set of values ( ⁠|$ =$| 0.3, |$h_x^2$| = 0, CV = 0.3, |$< m< >> =$| 0.25, PVE = 0.25 and n = 63) and then varied them one at a time to examine the influence of each factor on the relative performance of each method.

Our results show that MACAU works either as well as or better than other methods in almost all settings (Figure 3 and Supplementary Figure S9–14 ), probably because it both models count data directly and controls for sample non-independence. In contrast, the Poisson approach consistently fared the worst across all simulation scenarios, presumably because it fails to account for any sources of over-dispersion (Figure 3 and Supplementary Figures S9–14 ).

MACAU exhibits increased power to detect true positive DE genes across a range of simulation settings. Area under the curve (AUC) is shown as a measure of performance for MACAU (red), Negative binomial (purple), Poisson (green), GEMMA (blue), and Linear (cyan). Each simulation setting consists of 10 simulation replicates, and each replicate includes 10 000 simulated genes, with 1 000 DE and 9 000 non-DE. We used n = 63, hx 2 = 0.4, PVE = 0.25, and σ 2 = 0.25. In (A) we increased h 2 while maintaining CV = 0.3 and in (B) we increased CV while maintaining h 2 = 0.3. Boxplots of AUC across replicates for different methods show that (A) heritability (h 2 ) influences the relative performance of the methods that account for sample non-independence (MACAU and GEMMA) compared to the methods that do not (negative binomial, Poisson, linear) (B) variation in total read counts across individuals, measured by the coefficient of variation (CV), influences the relative performance of GEMMA and negative binomial. Insets in the two figures show the rank of different methods, where the top row represents the highest rank.

MACAU exhibits increased power to detect true positive DE genes across a range of simulation settings. Area under the curve (AUC) is shown as a measure of performance for MACAU (red), Negative binomial (purple), Poisson (green), GEMMA (blue), and Linear (cyan). Each simulation setting consists of 10 simulation replicates, and each replicate includes 10 000 simulated genes, with 1 000 DE and 9 000 non-DE. We used n = 63, hx 2 = 0.4, PVE = 0.25, and σ 2 = 0.25. In (A) we increased h 2 while maintaining CV = 0.3 and in (B) we increased CV while maintaining h 2 = 0.3. Boxplots of AUC across replicates for different methods show that (A) heritability (h 2 ) influences the relative performance of the methods that account for sample non-independence (MACAU and GEMMA) compared to the methods that do not (negative binomial, Poisson, linear) (B) variation in total read counts across individuals, measured by the coefficient of variation (CV), influences the relative performance of GEMMA and negative binomial. Insets in the two figures show the rank of different methods, where the top row represents the highest rank.

Among the factors that influence the relative rank of various methods, the most important factor was heritability |$< m< >>( <> )$| (Figure 3A). While all methods perform worse with increasing gene expression heritability, heritability disproportionately affects the performance of models that do not account for relatedness (i.e. negative binomial, Poisson and Linear), whereas when heritability is zero ( |$ = 0$|⁠ ), these approaches tend to perform slightly better. Therefore, for non-heritable genes, linear models perform slightly better than GEMMA, and negative binomial models work similarly or slightly better than MACAU. This observation most likely arises because linear and negative binomial models require fewer parameters and thus have a greater number of degrees of freedom. However, even in this setting, the difference between MACAU and negative binomial is small, suggesting that MACAU is robust to model misspecification and works reasonably well even for non-heritable genes. On the other hand, when heritability is moderate ( |$ = 0.3$|⁠ ) or high ( |$ = 0.6$|⁠ ), the methods that account for sample non-independence are much more powerful than the methods that do not. Because almost all genes are influenced by cis-eQTLs ( 46, 47) and are thus likely heritable to some extent, MACAU's robustness for non-heritable genes and its high performance gain for heritable genes make it appealing.

The second most important factor in relative model performance was the variation of TRCs across individuals (CV Figure 3B). While all methods perform worse with increasing CV, CV particularly affects the performance of GEMMA. Specifically, when CV is small (0.3 as the baboon data), GEMMA works well and is the second best method behind MACAU. However, when CV is moderate (0.6) or high (0.9), the performance of GEMMA quickly decays: it becomes only the fourth best method when CV = 0.9. GEMMA performs poorly in high CV settings presumably because the LMM fails to account for the mean-variance dependence observed in count data, which is in agreement with previous findings ( 59, 105).

The other four factors we explored had small impacts on the relative performance of the alternative methods, although they did affect their absolute performance. For example, as one would expect, power increases with large effect sizes (PVE) ( Supplementary Figure S9 ) or large sample sizes ( Supplementary Figure S10 ), and decreases with large over-dispersion |$$| ( Supplementary Figure S11 ) or large |$h_x^2$| ( Supplementary Figure S12 ).

Finally, we included comparisons with edgeR ( 25) and DESeq2 ( 15). In the basic parameter simulation setting (n = 63, CV = 0.3, |$h_x^2$| = 0, PVE = 0.25, |$ =$| 0.3 and |$< m< >> =$| 0.25), we again discretized the continuous predictor |$x$| into a binary 0/1 variable based on the median predictor value across individuals. Results for all methods are shown in Supplementary Figure S13A . For the five methods also tested on a continuous predictor variable, the power on binarized values is much reduced compared with the power when the predictor variable is modeled without binarization (e.g. Supplementary Figure S13A versus Figure 3). Further, neither edgeR nor DESeq2 perform well, consistent with the recent move from these methods towards linear models in differential expression analysis ( 3, 7, 45– 47, 106). This result is not contingent on having large sample sizes. In small sample size settings (n = 6, n = 10 and n = 14, with samples balanced between the two classes, 0 or 1), MACAU again outperforms the other methods, though the power difference is much smaller (n = 10 and n = 14 Supplementary Figures S13C and 13D ) and sometimes negligible (n = 6, Supplementary Figure S13B ).

In summary, the power of MACAU and other methods, as well as the power difference between methods, is influenced in a continuous fashion by multiple factors. Larger sample sizes, larger effect sizes, lower read depth variation, lower gene expression heritability, lower predictor variable heritability and lower over-dispersion all increase power. However, MACAU's power is less diminished by high gene expression heritability and high read depth variability than the non-mixed model methods, while retaining the advantage of modeling the count data directly. In challenging data analysis settings (e.g. when sample size is low and effect size is low: Supplementary Figure S13B for n = 6), no method stands out and using MACAU results in no or negligible gains in power relative to other methods. When the sample size is low (n = 6) and effect sizes are large, however, MACAU consistently outperforms the other methods (n = 6, Supplementary Figure S14 ).

Real data applications

To gain insight beyond simulation, we applied MACAU and the other six methods to three recently published RNAseq datasets.

The first dataset we considered is the baboon RNAseq study ( 7) used to parameterize the simulations above. Expression measurements on 12 018 blood-expressed genes were collected by the (ABRP) ( 71) for 63 adult baboons (26 females and 37 males), among which some were relatives. Here, we applied MACAU and the six other methods to identify genes with sex-biased expression patterns. Sex-associated genes are known to be enriched on sex chromosomes ( 107, 108), and we use this enrichment as one of the criteria to compare method performance, as in ( 18). Because the same nominal P-value from different methods may correspond to different type I errors, we compared methods based on empirical FDR. In particular, we permuted the data to construct an empirical null, estimated the FDR at any given P-value threshold, and counted the number of discoveries at a given FDR cutoff (see ‘Materials and Methods’ section for permutation details and a comparison between two different permutation procedures Supplementary Figure S15 ).

In agreement with our simulations, MACAU was the most powerful method of those we considered. Specifically, at an empirical FDR of 5%, MACAU identified 105 genes with sex-biased expression patterns, 40% more than that identified by the linear model, the second best method at this FDR cutoff (Figure 4A). At a more relaxed FDR of 10%, MACAU identified 234 sex-associated genes, 47% more than that identified by the negative binomial model, the second best method at this FDR cutoff (Figure 4A). Further, as expected, the sex-associated genes detected by MACAU are enriched on the X chromosome (the Y chromosome is not assembled in baboons and is thus ignored), and this enrichment is stronger for the genes identified by MACAU than by the other methods (Figure 4B). Of the remaining approaches, the negative binomial, linear model and GEMMA all performed similarly and are ranked right after MACAU. The Poisson model performs the worst, and edgeR and DESeq2 fall between the Poisson model and the other methods (Figure 4A and B).

MACAU identifies more differentially expressed genes than other methods in the baboon (A-B) and FUSION (C-F) data sets. Methods for comparison include MACAU (red), Negative binomial (purple), Poisson (green), edgeR (magenta), DESeq2 (rosybrown), GEMMA (blue), and Linear (cyan). (A) shows the number of sex-associated genes identified by different methods at a range of empirical false discovery rates (FDRs). (B) shows the number of genes that are on the X chromosome out of the genes that have the strongest sex association for each method (note that the Y chromosome is not assembled in baboons and is thus ignored). For instance, in the top 400 genes identified by MACAU, 41 of them are also on the X chromosome. (C) shows the number of T2D-associated genes identified by different methods at a range of empirical false discovery rates (FDRs). (D) shows the number of genes that are in the list of top 1 000 genes most significantly associated with GL out of the genes that have the strongest association for T2D for each method. For instance, in the top 1 000 genes with the strongest T2D association identified by MACAU, 428 of them are also in the list of top 1 000 genes with the strongest GL association identified by the same method. (E) shows the number of GL-associated genes identified by different methods at a range of FDRs. (F) shows the number of genes that are in the list of top 1 000 genes most significantly associated with T2D out of the genes that have the strongest association for GL for each method. T2D: type II diabetes GL: fasting glucose level.

MACAU identifies more differentially expressed genes than other methods in the baboon (A-B) and FUSION (C-F) data sets. Methods for comparison include MACAU (red), Negative binomial (purple), Poisson (green), edgeR (magenta), DESeq2 (rosybrown), GEMMA (blue), and Linear (cyan). (A) shows the number of sex-associated genes identified by different methods at a range of empirical false discovery rates (FDRs). (B) shows the number of genes that are on the X chromosome out of the genes that have the strongest sex association for each method (note that the Y chromosome is not assembled in baboons and is thus ignored). For instance, in the top 400 genes identified by MACAU, 41 of them are also on the X chromosome. (C) shows the number of T2D-associated genes identified by different methods at a range of empirical false discovery rates (FDRs). (D) shows the number of genes that are in the list of top 1 000 genes most significantly associated with GL out of the genes that have the strongest association for T2D for each method. For instance, in the top 1 000 genes with the strongest T2D association identified by MACAU, 428 of them are also in the list of top 1 000 genes with the strongest GL association identified by the same method. (E) shows the number of GL-associated genes identified by different methods at a range of FDRs. (F) shows the number of genes that are in the list of top 1 000 genes most significantly associated with T2D out of the genes that have the strongest association for GL for each method. T2D: type II diabetes GL: fasting glucose level.

The second dataset we considered is an RNAseq study on T2D collected as part of the FUSION study ( 60). Here, the data were collected from skeletal muscle samples from 267 individuals with expression measurements on 21 753 genes. Individuals are from three municipalities (Helsinki, Savitaipale and Kuopio) in Finland. Individuals within each municipality are more closely related than individuals between municipalities (e.g. the top genotype PCs generally correspond to the three municipalities Supplementary Figure S16 ). Two related phenotypes were available to us: 162 individuals with T2D or NGT status (i.e. case/control) based on the OGTT and 267 individuals with the quantitative trait fasting GL, a biologically relevant trait of T2D.

We performed analyses to identify genes associated with T2D status as well as genes associated with GL. To accommodate edgeR and DESeq2, we also discretized the continuous GL values into binary 0/1 categories based on the median GL value across individuals. We refer to the resulting values as GL01. Therefore, we performed two sets of analyses for GL: one on the continuous GL values and the other on the discretized GL01 values. Consistent with simulations and the baboon data analysis, MACAU identified more T2D-associated genes and GL-associated genes than other methods across a range of empirical FDR values. For the T2D analysis, MACAU identified 23 T2D-associated genes at an FDR of 5%, while GEMMA and the linear model, the second best methods at this FDR cutoff, identified only 1 T2D-associated gene (Figure 4C). Similarly, at an FDR of 10%, MACAU identified 123 T2D-associated genes, 51% more than that identified by the linear model, the second best method at this FDR cutoff (Figure 4C). For GL analysis, based on an FDR of 5%, MACAU detected 12 DE genes, while the other methods did not identify any DE genes at this FDR cutoff. At an FDR of 10%, MACAU identified 100 GL associated genes, while the second best methods—the linear model and GEMMA—identified 12 DE genes (Figure 4E). For the dichotomized GL01, none of the methods detected any DE genes even at a relaxed FDR cutoff of 20%, highlighting the importance of modeling the original continuous predictor variable in DE analysis.

Several lines of evidence support the biological validity of the genes detected by MACAU. First, we performed gene ontology (GO) analysis using LRpath ( 109) on T2D and GL associated genes identified by MACAU, as in the FUSION study ( 60) ( Supplementary Figure S17 ). The GO analysis results for T2D and GL are consistent with previous studies ( 60, 110) and are also similar to each other, as expected given the biological relationship between the two traits. In particular, T2D status and high GL are associated with decreased expression of cellular respiratory pathway genes, consistent with previous observations ( 60, 110). T2D status and GL are also associated with several pathways that are related to mTOR, including generation of precursor metabolites, poly-ubiquitination and vesicle trafficking, in agreement with a prominent role of mTOR pathway in T2D etiology ( 111– 114).

Second, we performed overlap analyses between T2D and GL associated genes. We reasoned that T2D-associated genes are likely associated with GL because T2D shares a common genetic basis with GL ( 115– 117) and T2D status is determined in part by fasting GL. Therefore, we used the overlap between genes associated with T2D and genes associated with GL as a measure of method performance. In the overlap analysis, genes with the strongest T2D association identified by MACAU show a larger overlap with the top 1000 genes that have the strongest GL association than did genes identified by other methods (Figure 4D). For instance, among the top 100 genes with the strongest T2D-association evidence from MACAU, 63 of them also show strong association evidence with GL. In contrast, only 55 of the top 100 genes with the strongest T2D-association identified by GEMMA, the second best method, show strong association evidence with GL. We observed similar results, with MACAU performing the best, when performing the reciprocal analysis (overlap between genes with the strongest GL-association and the top 1000 genes that have the strongest T2D-association: Figure 4F). To include the comparison with edgeR and DESeq2, we further examined the overlap between T2D associated genes and GL01 associated genes for all methods ( Supplementary Figure S18 ). Again, MACAU performs the best, followed by GEMMA and the linear model, and neither edgeR nor DESeq2 perform well in this context ( Supplementary Figure S18 ). Therefore, MACAU appears to both confer more power to identify biologically relevant DE genes and be more consistent across analyses of related phenotypes.

To assess the type I error rate of various methods, we permuted the trait data from the baboon and the FUSION studies. Consistent with our simulation results, the P-values from MACAU and GEMMA under the permuted null were close to uniformly distributed (slightly conservative) in both datasets, whereas the other methods were not ( Supplementary Figures S19 and 20 ). In addition, none of the methods compared here are sensitive to outliers in the two datasets ( Supplementary Figures S21–23 ).

Finally, although large, population-based RNAseq datasets are becoming more common, MACAU's flexible PMM modeling framework allows it to be applied to DE analysis in small datasets with unrelated individuals as well. In this setting, MACAU can use the gene expression covariance matrix as the |$oldsymbol< K>$| matrix to control for hidden confounding effects that are commonly observed in sequencing studies ( 48– 51). Hidden confounders can induce similarity in gene expression levels across many genes even though individuals are unrelated ( 52– 56), similar to the effects of kinship or population structure. Therefore, by defining |$oldsymbol< K>$| using a gene expression (instead of genetic) covariance matrix, MACAU can effectively control for sample non-independence induced by hidden confounders, thus extending the LMM widely used to control for hidden confounders in array based studies ( 52– 56) to sequencing count data.

To illustrate this application, we analyzed a third dataset on LCLs derived from 69 unrelated Nigerian individuals (YRI) ( 3) from the HapMap project ( 118), with expression measurements on 13 319 genes. We also aimed to identify sex-associated genes in this dataset. To demonstrate the effectiveness of MACAU in small samples, we randomly subsampled individuals from the data to create small datasets with either n = 6 (3 males and 3 females), n = 10 (5 males and 5 females) or n = 14 individuals (7 males and 7 females). For each sample size n, we performed 20 replicates of random subsampling and then evaluated method performance by averaging across replicates. In each replicate, we used the gene expression covariance matrix as |$oldsymbol$| and compared MACAU's performance against other methods. Because of the small sample size, none of the methods were able to identify DE genes at an FDR cutoff of 10%, consistent with recent arguments that at least 6–12 biological replicates are needed to ensure sufficient power and replicability in DE analysis ( 11). We therefore used enrichment of genes on the sex chromosomes to compare the performance of different methods ( Supplementary Figure S24 ). The enrichment of top ranked sex-associated genes on sex chromosomes has previously been used for method comparison and is especially suitable for comparing methods in the presence of batch effects and other hidden confounding factors ( 119).

In this comparison, MACAU performs the best of all methods when the sample size is either n = 10 or n = 14, and is ranked among the best (together with the negative binomial model) when n = 6. For instance, when n = 6, among the top 50 genes identified by each method, the number of genes on the sex chromosomes for MACAU, negative binomial, Poisson, edgeR, DESeq2, GEMMA and Linear are 3.3, 2.7, 3.1, 1.8, 3.0, 2.0 and 2.4, respectively. The advantage of MACAU becomes larger when the sample size increases: for example, when n = 14, an average of 10.6 genes in the top 50 genes from MACAU are on the sex chromosomes, which is again larger than that from the negative binomial (8.3), Poisson (6.0), edgeR (6.65), DESeq2 (8.8), GEMMA (9.8) or Linear (8.05). These results suggest that MACAU can also perform better than existing methods in relatively small sample study designs with unrelated individuals by controlling for hidden confounders. However, MACAU's power gain is much smaller in this setting than in the first two datasets we considered (the baboon and Fusion data). In addition, MACAU's power gain is negligible in the case of n = 6 when compared with the second best method, though its power gain over the commonly used edgeR and DESeq2 is still substantial. MACAU's small power gain in this data presumably stems from both the small sample size and the small effect size of sex in the data, consistent with previous reports for blood cell-derived gene expression ( 3, 7, 120).


Acknowledgements

Special thanks to Leander Dony, who debugged, updated, and tested the case study to work with the latest methods. Furthermore, we would like to thank the many people who proofread the case study notebook and the manuscript and improved it with their comments and expertise. For this, we acknowledge the input of Maren Buttner, David Fischer, Alex Wolf, Lukas Simon, Luis Ospina-Forero, Sophie Tritschler, Niklas Koehler, Goekcen Eraslan, Benjamin Schubert, Meromit Singer, Dana Pe'er, and Rahul Satija. Special thanks for this also to the anonymous reviewers of the manuscript and the editor, Thomas Lemberger, for their thorough, constructive, and extensive comments. The case study notebook was tested and improved by the early adopters Marius Lange, Hananeh Aliee, Subarna Palit and Lisa Thiergart. Volker Bergen and Alex Wolf also contributed to the workflow by making scanpy adaptations. The choice of dataset to optimally show all aspects of the analysis workflow was facilitated by the kind input from Adam Haber and Aviv Regev. This work was supported by the BMBF grant# 01IS18036A and grant# 01IS18053A, by the German Research Foundation (DFG) within the Collaborative Research Centre 1243, Subproject A17, by the Helmholtz Association (Incubator grant sparse2big, grant # ZT-I-0007) and by the Chan Zuckerberg Initiative DAF (advised fund of Silicon Valley Community Foundation, 182835).


8.8 Comparing/Combining scRNASeq datasets

8.8.1 Introduction

As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is “label-centric” which is focused on trying to identify equivalent cell-types/states across datasets by comparing individual cells or groups of cells. The other approach is “cross-dataset normalization” which attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.

The label-centric approach can be used with dataset with high-confidence cell-annotations, e.g. the Human Cell Atlas (HCA) (Regev et al. 2017) or the Tabula Muris ( . ) once they are completed, to project cells or clusters from a new sample onto this reference to consider tissue composition and/or identify cells with novel/unknown identity. Conceptually, such projections are similar to the popular BLAST method (Altschul et al. 1990) , which makes it possible to quickly find the closest match in a database for a newly identified nucleotide or amino acid sequence. The label-centric approach can also be used to compare datasets of similar biological origin collected by different labs to ensure that the annotation and the analysis is consistent.

Figure 2.4: Label-centric dataset comparison can be used to compare the annotations of two different samples.

Figure 2.5: Label-centric dataset comparison can project cells from a new experiment onto an annotated reference.

The cross-dataset normalization approach can also be used to compare datasets of similar biological origin, unlike the label-centric approach it enables the join analysis of multiple datasets to facilitate the identification of rare cell-types which may to too sparsely sampled in each individual dataset to be reliably detected. However, cross-dataset normalization is not applicable to very large and diverse references since it assumes a significant portion of the biological variablility in each of the datasets overlaps with others.

Figure 2.6: Cross-dataset normalization enables joint-analysis of 2+ scRNASeq datasets.

8.8.2 Datasets

We will running these methods on two human pancreas datasets: (Muraro et al. 2016) and (Segerstolpe et al. 2016) . Since the pancreas has been widely studied, these datasets are well annotated.

This data has already been formatted for scmap. Cell type labels must be stored in the cell_type1 column of the colData slots, and gene ids that are consistent across both datasets must be stored in the feature_symbol column of the rowData slots.

First, lets check our gene-ids match across both datasets:

Here we can see that 96% of the genes present in muraro match genes in segerstople and 72% of genes in segerstolpe are match genes in muraro. This is as expected because the segerstolpe dataset was more deeply sequenced than the muraro dataset. However, it highlights some of the difficulties in comparing scRNASeq datasets.

We can confirm this by checking the overall size of these two datasets.

In addition, we can check the cell-type annotations for each of these dataset using the command below:

Here we can see that even though both datasets considered the same biological tissue the two datasets, they have been annotated with slightly different sets of cell-types. If you are familiar withpancreas biology you might recognize that the pancreatic stellate cells (PSCs) in segerstolpe are a type of mesenchymal stem cell which would fall under the “mesenchymal” type in muraro. However, it isn’t clear whether these two annotations should be considered synonymous or not. We can use label-centric comparison methods to determine if these two cell-type annotations are indeed equivalent.

Alternatively, we might be interested in understanding the function of those cells that were “unclassified endocrine” or were deemed too poor quality (“not applicable”) for the original clustering in each dataset by leveraging in formation across datasets. Either we could attempt to infer which of the existing annotations they most likely belong to using label-centric approaches or we could try to uncover a novel cell-type among them (or a sub-type within the existing annotations) using cross-dataset normalization.

To simplify our demonstration analyses we will remove the small classes of unassigned cells, and the poor quality cells. We will retain the “unclassified endocrine” to see if any of these methods can elucidate what cell-type they belong to.

8.8.3 Projecting cells onto annotated cell-types (scmap)

We recently developed scmap (Kiselev and Hemberg 2017) - a method for projecting cells from a scRNA-seq experiment onto the cell-types identified in other experiments. Additionally, a cloud version of scmap can be run for free, withmerged_seurat restrictions, from http://www.hemberg-lab.cloud/scmap.

8.8.3.1 Feature Selection

Once we have a SingleCellExperiment object we can run scmap . First we have to build the “index” of our reference clusters. Since we want to know whether PSCs and mesenchymal cells are synonymous we will project each dataset to the other so we will build an index for each dataset. This requires first selecting the most informative features for the reference dataset.

Genes highlighted with the red colour will be used in the futher analysis (projection).

From the y-axis of these plots we can see that scmap uses a dropmerged_seurat-based feature selection method.

Now calculate the cell-type index:

We can also visualize the index:

You may want to adjust your features using the setFeatures function if features are too heavily concentrated in only a few cell-types. In this case the dropmerged_seurat-based features look good so we will just them.

Exercise Using the rowData of each dataset how many genes were selected as features in both datasets? What does this tell you abmerged_seurat these datasets?

8.8.3.2 Projecting

scmap computes the distance from each cell to each cell-type in the reference index, then applies an empirically derived threshold to determine which cells are assigned to the closest reference cell-type and which are unassigned. To account for differences in sequencing depth distance is calculated using the spearman correlation and cosine distance and only cells with a consistent assignment with both distances are returned as assigned.

We will project the segerstolpe dataset to muraro dataset:

and muraro onto segerstolpe

Note that in each case we are projecting to a single dataset but that this could be extended to any number of datasets for which we have computed indices.

Now lets compare the original cell-type labels with the projected labels:

Here we can see that cell-types do map to their equivalents in segerstolpe, and importantly we see that all but one of the “mesenchymal” cells were assigned to the “PSC” class.

Again we see cell-types match each other and that all but one of the “PSCs” match the “mesenchymal” cells providing strong evidence that these two annotations should be considered synonymous.

We can also visualize these tables using a Sankey diagram:

Exercise How many of the previously unclassified cells would be be able to assign to cell-types using scmap?

8.8.4 Cell-to-Cell mapping

scmap can also project each cell in one dataset to its approximate closest neighbouring cell in the reference dataset. This uses a highly optimized search algorithm allowing it to be scaled to very large references (in theory 100,000-millions of cells). However, this process is stochastic so we must fix the random seed to ensure we can reproduce our results.

We have already performed feature selection for this dataset so we can go straight to building the index.

In this case the index is a series of clusterings of each cell using different sets of features, parameters k and M are the number of clusters and the number of features used in each of these subclusterings. New cells are assigned to the nearest cluster in each subclustering to generate unique pattern of cluster assignments. We then find the cell in the reference dataset with the same or most similar pattern of cluster assignments.

We can examine the cluster assignment patterns for the reference datasets using:

To project and find the w nearest neighbours we use a similar command as before:

We can again look at the results:

This shows the column number of the 5 nearest neighbours in segerstolpe to each of the cells in muraro. We could then calculate a pseudotime estimate, branch assignment, or other cell-level data by selecting the appropriate data from the colData of the segerstolpe data set. As a demonstration we will find the cell-type of the nearest neighbour of each cell.

8.8.5 Metaneighbour

Metaneighbour is specifically designed to ask whether cell-type labels are consistent across datasets. It comes in two versions. First is a fully supervised method which assumes cell-types are known in all datasets and calculates how “good” those cell-type labels are. (The precise meaning of “good” will be described below). Alternatively, metaneighbour can estimate how similar all cell-types are to each other both within and across datasets. We will only be using the unsupervised version as it has much more general applicability and is easier to interpret the results of.

Metaneighbour compares cell-types across datasets by building a cell-cell spearman correlation network. The method then tries to predict the label of each cell through weighted “votes” of its nearest-neighbours. Then scores the overall similarity between two clusters as the AUROC for assigning cells of typeA to typeB based on these weighted votes. AUROC of 1 would indicate all the cells of typeA were assigned to typeB before any other cells were, and an AUROC of 0.5 is what you would get if cells were being randomly assigned.

Metanighbour is just a couple of R functions not a complete package so we have to load them using source

8.8.5.1 Prepare Data

Metaneighbour requires all datasets to be combined into a single expression matrix prior to running:

Metaneighbor includes a feature selection method to identify highly variable genes.

Since Metaneighbor is much slower than scmap , we will down sample these datasets.

Now we are ready to run Metaneighbor. First we will run the unsupervised version that will let us see which cell-types are most similar across the two datasets.

8.8.6 mnnCorrect

mnnCorrect corrects datasets to facilitate joint analysis. It order to account for differences in composition between two replicates or two different experiments it first matches invidual cells across experiments to find the overlaping biologicial structure. Using that overlap it learns which dimensions of expression correspond to the biological state and which dimensions correspond to batch/experiment effect mnnCorrect assumes these dimensions are orthologal to each other in high dimensional expression space. Finally it removes the batch/experiment effects from the entire expression matrix to return the corrected matrix.

To match individual cells to each other across datasets, mnnCorrect uses the cosine distance to avoid library-size effect then identifies mututal nearest neighbours ( k determines to neighbourhood size) across datasets. Only overlaping biological groups should have mutual nearest neighbours (see panel b below). However, this assumes that k is set to approximately the size of the smallest biological group in the datasets, but a k that is too low will identify too few mutual nearest-neighbour pairs to get a good estimate of the batch effect we want to remove.

Learning the biological/techncial effects is done with either singular value decomposition, similar to RUV we encounters in the batch-correction section, or with principal component analysis with the opitimized irlba package, which should be faster than SVD. The parameter svd.dim specifies how many dimensions should be kept to summarize the biological structure of the data, we will set it to three as we found three major groups using Metaneighbor above. These estimates may be futher adjusted by smoothing ( sigma ) and/or variance adjustment ( var.adj ).

mnnCorrect also assumes you’ve already subset your expression matricies so that they contain identical genes in the same order, fortunately we have already done with for our datasets when we set up our data for Metaneighbor.

Figure 8.20: mnnCorrect batch/dataset effect correction. From Haghverdi et al. 2017

First let’s check that we found a sufficient number of mnn pairs, mnnCorrect returns a list of dataframe with the mnn pairs for each dataset.

The first and second columns contain the cell column IDs and the third column contains a number indicating which dataset/batch the column 2 cell belongs to. In our case, we are only comparing two datasets so all the mnn pairs have been assigned to the second table and the third column contains only ones

mnnCorrect found 2533 sets of mutual nearest-neighbours between n_unique_seger segerstolpe cells and n_unique_muraro muraro cells. This should be a sufficient number of pairs but the low number of unique cells in each dataset suggests we might not have captured the full biological signal in each dataset.

Exercise Which cell-types had mnns across these datasets? Should we increase/decrease k?

Now we could create a combined dataset to jointly analyse these data. However, the corrected data is no longer counts and usually will contain negative expression values thus some analysis tools may no longer be appropriate. For simplicity let’s just plot a joint TSNE.

8.8.7 Cannonical Correlation Analysis (Seurat)

The Seurat package contains another correction method for combining multiple datasets, called CCA. However, unlike mnnCorrect it doesn’t correct the expression matrix itself directly. Instead Seurat finds a lower dimensional subspace for each dataset then corrects these subspaces. Also different from mnnCorrect, Seurat only combines a single pair of datasets at a time.

Seurat uses gene-gene correlations to identify the biological structure in the dataset with a method called canonical correlation analysis (CCA). Seurat learns the shared structure to the gene-gene correlations and then evaluates how well each cell fits this structure. Cells which must better described by a data-specific dimensionality reduction method than by the shared correlation structure are assumed to represent dataset-specific cell-types/states and are discarded before aligning the two datasets. Finally the two datasets are aligned using ‘warping’ algorithms which normalize the low-dimensional representations of each dataset in a way that is robust to differences in population density.

Note because Seurat uses up a lot of library space you will have to restart your R-session to load it, and the plots/merged_seuratput won’t be automatically generated on this page.

First we will reformat our data into Seurat objects:

Next we must normalize, scale and identify highly variable genes for each dataset:

Figure 8.21: muraro variable genes

Figure 8.22: segerstolpe variable genes

Eventhough Seurat corrects for the relationship between dispersion and mean expression, it doesn’t use the corrected value when ranking features. Compare the results of the command below with the results in the plots above:

But we will follow their example and use the top 2000 most dispersed genes withmerged_seurat correcting for mean expression from each dataset anyway.

Exercise Find the features we would use if we selected the top 2000 most dispersed after scaling by mean. (Hint: consider the order function)

Now we will run CCA to find the shared correlation structure for these two datasets:

Note to speed up the calculations we will be using only the top 5 dimensions but ideally you would consider many more and then select the top most informative ones using DimHeatmap .

Figure 8.23: Before Aligning

To identify dataset specific cell-types we compare how well cells are ‘explained’ by CCA vs dataset-specific principal component analysis.

Here we can see that despite both datasets containing endothelial cells, almost all of them have been discarded as “dataset-specific”. Now we can align the datasets:

Figure 8.24: After Aligning

Exercise Compare the results for if you use the features after scaling dispersions.

Figure 8.25: After Aligning

Advanced Exercise Use the clustering methods we previously covered on the combined datasets. Do you identify any novel cell-types?

8.8.8 sessionInfo()


Acknowledgements

We gratefully acknowledge Toshi Kinoshita (Department of Pathology, University of Wisconsin School of Medicine and Public Heath, Madison, WI) for assistance with histology.

Author contributions

N.V.W. conceived the study and obtained funding. N.V.W., S.L.T. and C.K. designed the experiments. N.V.W. and M.Y. conducted the in vivo experiments. C.L. processed all RNA and tissue samples, and performed immunohistochemistry. J.A.D. performed all statistical analyses under the direction of C.K. N.V.W. wrote the manuscript. All authors reviewed and approved the final version.

This work was funded by grants from the National Institute on Deafness and other Communication Disorders [grant numbers R01 DC004428, R01 DC010777] and the Clinical and Translational Science Award (CTSA) program of the National Center for Research Resources [grant number UL1 RR025011].


Watch the video: The BEST Speedcubing Method? CFOPROUXZZ Comparison (December 2021).