Information

27.3: Ancestral Recombination Graphs - Biology


In Figure 28.24 a, the two chromosomes at the top represent the homologous chromosomes of a parent. In reality, recombination happens during meiosis so that a parent will pass on some genetic information from both grandparents, effectively passing on a better representation of the parent genetic information.
At each generation, a recombination event can occur at any loci. The evolutionary history of recombination can be tracked through a sequential graph of trees, such that the ith tree in the graph represents recombination at the ith locus.

Fill in this section based on: www.eecs.berkeley.edu/yss/Pub/SH-JCBO5.pdf and the course notes from 2012. More on this topic could be added in the future

The Sequentially Markov Coalescent

The Sequentially Markov Coalescent Model addresses the role of recombination in tree construction. With recombination involved, a sequence may have two parents, which complicates construction. The Sequentially Markov Coalescent Model tells us that move sequentially from left to right is a simpler and much more efficient approach to analyzing the tree; the approach essentially breaks the tree into local trees and overlays them to describe recombination events. More can be read in the following paper:

Elaborate upon intricacies of the model itself: http://www.ncbi.nlm.nih.gov/pubmed/21270390


27.3: Ancestral Recombination Graphs - Biology

A large offspring number diploid biparental multilocus population model of Moran type is our object of study. At each timestep, a pair of diploid individuals drawn uniformly at random contribute offspring to the population. The number of offspring can be large relative to the total population size. Similar `heavily skewed' reproduction mechanisms have been considered by various authors recently. Each diploid parental individual contributes exactly one chromosome to each diploid offspring, and hence ancestral lineages can only coalesce when in distinct individuals. A separation of timescales phenomenon is thus observed. A result of Möhle (1998) is extended to obtain convergence of the ancestral process to an ancestral recombination graph necessarily admitting simultaneous multiple mergers of ancestral lineages. The usual ancestral recombination graph is obtained as a special case of our model when the parents contribute only one offspring to the population each time. Due to diploidy and large offspring numbers, novel effects appear. For example, the marginal genealogy at each locus admits simultaneous multiple mergers in up to four groups, and different loci remain substantially correlated even as the recombination rate grows large. Thus, genealogies for loci far apart on the same chromosome remain correlated. Correlation in coalescence times for two loci is derived and shown to be a function of the coalescence parameters of our model. Extending the observations by Eldon and Wakeley (2008), predictions of linkage disequilibrium are shown to be functions of the reproduction parameters of our model, in addition to the recombination rate. Correlations in ratios of coalescence times between loci can be high, even when the recombination rate is high and sample size is large.


SIA: Selection Inference Using the Ancestral Recombination Graph

Detecting signals of selection from genomic data is a central problem in population genetics. Coupling the rich information in the ancestral recombination graph (ARG) with a powerful and scalable deep learning framework, we developed a novel method to detect and quantify positive selection: Selection Inference using the Ancestral recombination graph (SIA). Built on a Long Short-Term Memory (LSTM) architecture, a particular type of a Recurrent Neural Network (RNN), SIA can be trained to explicitly infer a full range of selection coefficients, as well as the allele frequency trajectory and time of selection onset. We benchmarked SIA extensively on simulations under a European human demographic model, and found that it performs as well or better as some of the best available methods, including state-of-the-art machine-learning and ARG-based methods. In addition, we used SIA to estimate selection coefficients at several loci associated with human phenotypes of interest. SIA detected novel signals of selection particular to the European (CEU) population at the MC1R and ABCC11 loci. In addition, it recapitulated signals of selection at the LCT locus and several pigmentation-related genes. Finally, we reanalyzed polymorphism data of a collection of recently radiated southern capuchino seedeater taxa in the genus Sporophila to quantify the strength of selection and improved the power of our previous methods to detect partial soft sweeps. Overall, SIA uses deep learning to leverage the ARG and thereby provides new insight into how selective sweeps shape genomic diversity.


Materials and Methods

Coalescent with recombination

The coalescent with recombination models a backwards in time coalescent and recombination process ( Hudson, 1983 ). In this process, three different events are possible: sampling, coalescence and recombination. Sampling events happen at predefined points in time. Recombination events happen at a rate proportional to the number of coexisting lineages at any point in time. Recombination events split the path of a lineage in two, with everything on one side of a recombination breakpoint going going in one ancestry direction and everything on the other side of a breakpoint going in the other direction. As shown in Figure 4 , the two parent lineages after a recombination event each �rry” a subset of the genome. In reality the viruses corresponding to those two lineages still �rry” the full genome, but only a part of it will have sampled descendants. In other words, only a part of the genome carried by a lineage at any time may impact the genome of a future lineage that is sampled. The probability of actually observing a recombination event on lineage l is proportional to how much genetic material that lineage carries. This can be computed as the difference between the last and first nucleotide position that is carried by l, which we denote as L ( l ) . Coalescent events happen between co-existing lineages at a rate proportional to the number of pairs of coexisting lineages at any point in time and inversely proportional to the effective population size. The parent lineage at each coalescent event will �rry” genetic material corresponding to the union of genetic material of the two child lineages.

Events that can occur on a recombination network as considered here. We consider events to occur from present backwards in time to the past (as is the norm when looking at coalescent processes. Lineages can be added upon sampling events, which occur at predefined points in time and are conditioned on. Recombination events split the path of a lineage in two, with everything on one side of a recombination breakpoint going in one and everything on the other side of a breakpoint going in the other direction.

Posterior probability

In order to perform joint Bayesian inference of recombination networks together with the parameters of the associated models, we use a MCMC algorithm to characterize the joint posterior density. The posterior density is denoted as:

Network Likelihood

While the evolutionary history of the entire genome is a network, the evolutionary history of each individual position in the genome can be described as a tree. We can therefore denote the likelihood of observing a sequence alignment (the data denoted D) given a network N and evolutionary model μ as:

with Di denoting the nucleotides at position i in the sequence alignment and Ti denoting the tree at position i. The likelihood at each individual position in the alignment can then be computed using the standard pruning algorithm ( Felsenstein, 1981 ). We implemented the network likelihood calculation P (Di|Ti, μ) such that it allows making use of all the standard site models in BEAST2. Currently, we only consider strict clock models and do not allow for rate variations across different branches of the network. This is because the number of edges in the network changes over the course of the MCMC, making relaxed clock models complex to implement. We implemented the network likelihood such that it can make use of caching of intermediate results and use unique patterns in the multiple sequence alignment, similar to what is done for tree likelihood computations.

Network Prior

The network prior is denoted by P (N|θ, ρ), which is the probability of observing a network and the embedding of segment trees under the coalescent with recombination model, with effective population size θ and per-lineage recombination rate ρ. It essentially plays the same role that tree prior plays in standard phylodynamic analyses.

We can calculate P (N| θ, ρ) by expressing it as the product of exponential waiting times between events (i.e., recombination, coalescent, and sampling events):

Given the coalescent process is a constant size coalescent and given the i-th event is a coalescent event, the event contribution is denoted as:

If the i-th event is a recombination event and assuming constant rates of recombination over time, the event contribution is denoted as:

The interval contribution denotes the probability of not observing any event in a given interval. It can be computed as the product of not observing any coalescent, nor any recombination events in interval i. We can therefore write:

where λ c denotes the rate of coalescence and can be expressed as:

and λ r denotes the rate of observing a recombination event on any co-existing lineage and can be expressed as:

In order to allow for recombination rates to vary across s sections S s of the genome, we modify λ r to differ in each section S s , such that:

with L ( l ) ∩ S s denoting denoting the amount of overlap between L ( l ) and S s . The recombination rate in each section s is denoted as ρs.

MCMC Algorithm for Recombination Networks

In order to explore the posterior space of recombination networks, we implemented a series of MCMC operators. These operators often have analogs in operators used to explore different phylogenetic trees and are similar to the ones used to explore reassortment networks ( Müller et al., 2020 ). Here, we briefly summarize each of these operators.

Add/remove operator.

The add/remove operator adds and removes recombination events. An extension of the subtree prune and regraft move for networks ( Bordewich et al., 2017 ) to jointly operate on segment trees as well. We additionally implemented an adapted version to sample re-attachment under a coalescent distribution to increase acceptance probabilities.

Loci diversion operator.

The loci diversion operator randomly changes the location of recombination breakpoints on a recombination event.

Exchange operator.

The exchange operator changes the attachment of edges in the network while keeping the network length constant.

Subnetwork slide operator.

The subnetwork slide operator changes the height of nodes in the network while allowing to change the topology.

Scale operator.

The scale operator scales the heights of individual nodes or the whole network without changing the network topology.

Gibbs operator.

The Gibbs operator efficiently samples any part of the network that is older than the root of any segment of the alignment and is thus not informed by any genetic data.

Empty loci preoperator.

The empty segment preoperator augments the network with edges that do not carry any loci for the duration of a move, to allow larger jumps in network space.

One of the issues when inferring these recombination networks is that the root height can be substantially larger than when not allowing for recombination events. This can cause computational issue when performing inferences. To circumvent this, we truncate the recombination networks by reducing the recombination rate some time after all positions of the sequence alignment have reached their common ancestor height. We validate the implementation of the coalescent with recombination network prior as well as all operators in the supplement S8. We also show that truncating the recombination networks does not affect the sampling of recombination networks prior to reaching the common common ancestor height of all positions in the sequence alignment.

We then tested whether we are able to infer recombination networks, recombination rates, effective population sizes and evolutionary parameters from simulated data. To do so, we randomly simulated recombination networks under the coalescent with recombination. On top of these, we then simulated multiple sequence alignments. We then re-infer the parameters used to simulate using our MCMC approach. As shown in Figure S9, these parameters are retrieved well from simulated data with little bias and accurate coverage of simulated parameters by credible intervals.

Additionally, we compared the effective sample size values from MCMC runs inferring recombination networks for the MERS spike protein to treating the evolutionary histories as trees. We find that although the effective sample size values are lower when inferring recombination networks, they are not orders of magnitude lower (see fig S10).

Recombination network summary

We implemented an algorithm to summarize distributions of recombination networks similar to the maximum clade credibility framework typically used to summarize trees in BEAST ( Heled and Bouckaert, 2013 ). In short, the algorithm summarizes over individual trees at each position in the alignment. To do so, we first compute how often we encountered the same coalescent event at every position in the alignment during the MCMC. We then choose the network that maximizes the clade support over each position as the maximum clade credibility (MCC) network.

The MCC networks are logged in the extended Newick format ( Cardona et al., 2008 ) and can be visualized in icytree.org ( Vaughan, 2017 ). We here plotted the MCC networks using an adapted version of baltic (https://github.com/evogytis/baltic).

Software

The Recombination package is implemented as an addon to the Bayesian phylogenetics software platform BEAST2 ( Bouckaert et al., 2018 ). All MCMC analyses performed here, were run using adaptive parallel tempering ( Müller and Bouckaert, 2020 ). The source code is available at https://github.com/nicfel/Recombination. We additionally provide a tutorial on how to setup and postprocess an analysis at https://github.com/nicfel/Recombination-Tutorial. The MCC networks are plotted using an adapted version of baltic (https://github.com/evogytis/baltic). All other plots are done in R using ggplot2 ( Wickham, 2016 ). The scripts to setup analyses and to plot the results in this manuscript are available from https://github.com/nicfel/Recombination-Material.

Sequence data

The genetic sequence data for OC43, NL63 and 229e were obtained from ViPR (http://www.viprbrc.org) as described in Kistler and Bedford (2021) . All virus sequences were isolated from a human host. The sequence data for the MERS analyses were the same a described in Dudas et al. (2018) , but using a randomly down sampled dataset of 100 sequences. For the SARS like analyses, we used several different deposited SARS-like genomes, mostly originating from bats, as well as humans and one pangolin derived sequence.

Rates of adaptation

The rates of adaptation were calculated using a modification of the McDonald-Kreitman method, as designed by Bhatt et al. (2011) , and implemented in Kistler and Bedford (2021) . Briefly, for each virus, we aligned the sequence of each gene or genomic region. Then, we the alignment into sliding 3-year slices, each containing a minimum of 3 sequenced isolates. We used the consensus sequence at the first time point as the outgroup. A comparison of the outgroup to the alignment of each subsequent temporal yielded a measure of synonymous and non-synonymous fixations and polymorphisms at each position in the alignment. We used proportional site-counting for these estimations ( Bhatt et al., 2010 ). We assumed that selectively neutral sites are all silent mutations as well as replacement polymorphisms occurring at frequencies between 0.15 and 0.75 ( Bhatt et al., 2011 ). We identified adaptive substitutions as non-synonymous fixations and high-frequency polymorphisms that exceed the neutral expectation. We then estimated the rate of adaptation (per codon per year) using linear regression of the number of adaptive substitutions inferred at each time point. In order to compute the 5’ of spike and 3’ of spike rates of adaptation we used the weighted average of all coding regions to the left (upstream) or right (downstream) of the spike gene, respectively, using the length of the individual sections as weights. We estimated the uncertainty by running the same analysis on 100 bootstrapped outgroups and alignments.


Conclusions

The ARG is indispensable to study evolutionary scenarios where recombination has occurred. With the development of next-generation sequencing (NGS) technologies there is a growing number of genomes at our disposal, many of which could have evolved under recombination (e.g., Utro et al., 2012 Rasmussen and Siepel, 2013). As a consequence, the importance and application of ARGs is expected to increase over the next years.

In order to make progress in the use of ARGs, a key aspect is the design of a standard format to represent the ARG. In this regard, McGill et al. (2013) presented a format based on XML syntax that can easily be used to store and communicate ARGs.


SIA: Selection Inference Using the Ancestral Recombination Graph

Detecting signals of selection from genomic data is a central problem in population genetics. Coupling the rich information in the ancestral recombination graph (ARG) with a powerful and scalable deep learning framework, we developed a novel method to detect and quantify positive selection: Selection Inference using the Ancestral recombination graph (SIA). Built on a Long Short-Term Memory (LSTM) architecture, a particular type of a Recurrent Neural Network (RNN), SIA can be trained to explicitly infer a full range of selection coefficients, as well as the allele frequency trajectory and time of selection onset. We benchmarked SIA extensively on simulations under a European human demographic model, and found that it performs as well or better as some of the best available methods, including state-of-the-art machine-learning and ARG-based methods. In addition, we used SIA to estimate selection coefficients at several loci associated with human phenotypes of interest. SIA detected novel signals of selection particular to the European (CEU) population at the MC1R and ABCC11 loci. In addition, it recapitulated signals of selection at the LCT locus and several pigmentation-related genes. Finally, we reanalyzed polymorphism data of a collection of recently radiated southern capuchino seedeater taxa in the genus Sporophila to quantify the strength of selection and improved the power of our previous methods to detect partial soft sweeps. Overall, SIA uses deep learning to leverage the ARG and thereby provides new insight into how selective sweeps shape genomic diversity.


With probability this operator either deletes a randomly selected conversion or creates a new conversion drawn directly from the prior (A3) where the terms on the right-hand side are those described in the manuscript. The HGF for the deletion form of the proposal is (A4) where is the conversion selected for deletion. The HGF for the addition form is simply

This operator improves mixing by allowing the sampler to transition directly between ARGs that have very similar local tree sets. It does this by proposing the addition or deletion of “detours”: pairs of conversions for which and lie on the same edge of and for which the attachment times satisfy

With probability either the deletion or the addition form of the operator is selected. For addition, a conversion is selected uniformly at random from Two times and are drawn from and labeled so that A nonroot node is then chosen uniformly at random from V. Let be the parent of If or lie on or it is not the case that both then the proposal is immediately rejected. Otherwise, is replaced with a pair of conversions and where l′ and u′ are the points on with times and respectively, and x′, ′y, and b′ are drawn from the affected site region boundary priors and

For deletion, a nonroot node is chosen uniformly at random from V, and is defined as its parent. A pair of conversions, and are chosen uniformly at random satisfying the requirements lies on and lies on This pair is replaced by a single conversion

The HGF for the addition form is (A5) where is the number of conversions r″ in R′ where u″ and l″ lie on distinct CF edges and where u″ lies on Similarly, is the number of conversions with u″ and l″ on distinct edges and where l″ lies on For the deletion form the HGF is (A6)


27.3: Ancestral Recombination Graphs - Biology

The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galápagos Islands.


27.3: Ancestral Recombination Graphs - Biology

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.


27.3: Ancestral Recombination Graphs - Biology

Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased.