The fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%?

I would appreciate help in understanding the meaning, logic, and, in particular, how to interpret the phrase:

The fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%

I could ask numerous question regarding the biology (including what happens to the rest of the genome, and positive selection) and formal logic of this statement, but maybe I could impose, and just leave it to whomever kindly replies.

This remark comes from the abstract of Grauer & al. "On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE"

And the Abstract itself:

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 - 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these "functional" regions or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly by employing the seldom used "causal role" definition of biological function and then applying it inconsistently to different biochemical properties, by committing a logical fallacy known as "affirming the consequent," by failing to appreciate the crucial difference between "junk DNA" and "garbage DNA," by using analytical methods that yield biased errors and inflate estimates of functionality, by favoring statistical sensitivity over specificity, and by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.


I will try to tackle the terminology first

  • Conserved: That remains identical (or at least very similar) across species.

  • Purifying selection: Is one of the possible forces that lead to conservation of important sequences of DNA, by removing deleterious alleles.

So, the phrase simply means that the proportion of the genome that is very similar in sequence with the genome of other species (i.e. evolutionarily conserved) is less than 10%.

This is the answer to you question. Then, I just want to add that I share David's view on the abstract.

Improbable Research

The ENCODE project (its stated goal is “ to identify all functional elements in the human genome sequence “) has attracted much criticism. Dan Graur, a member of the Luxuriant Former Hair Club for Scientists (LFHCfS), and his colleagues made hearty contributions to that river of criticism:

“On the immortality of television sets: ‘function’ in the human genome according to the evolution-free gospel of ENCODE,” Dan Graur, Yichen Zheng, Nicholas Price, Ricardo BR Azevedo, Rebecca A. Zufall, and Eran Elhaik, Genome Biology and Evolution, vol. 5, no. 3, 2013, pp. 578-590. The authors write:

A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection….

“I would be quite proud to have served on the committee that designed the E. coli genome. There is, however, no way that I would admit to serving on a committee that designed the human genome. Not even a university committee could botch something that badly.” —David Penny (personal communication) …

“The onion test is a simple reality check for anyone who thinks they can assign a function to every nucleotide in the human genome. Whatever your proposed functions are, ask yourself this question: Why does an onion need a genome that is about five times larger than ours?” —T. Ryan Gregory (personal communication) …

The absurd alternative, which unfortunately was adopted by ENCODE, is to assume that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events, such as rust, erosion, static electricity, and earthquakes can affect it. The convoluted rationale for the decision to discard evolutionary conservation and constraint as the arbiters of functionality put forward by a lead ENCODE author (Stamatoyannopoulos 2012) is groundless and self-serving.

Share this:

Posted by Marc Abrahams on Saturday, December 6th, 2014 at 9:59 am under Arts and Science, Research News.

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.


Abundance and diversity

Most of the retrieved elements are fragmented and truncated, and nested insertions are common particularly among pericentromeric elements belonging to the Athila superfamily, though the core centromere sequences themselves were not available. In fact, the size of the A. thaliana genome has been recently estimated as approximately 157 Mbp (around 20% larger than the estimate published with the genome sequence), and the additional size appears to be due to (unsequenced) heterochromatic repetitive DNA in the centromeres, telomeres and nucleolar-organizing regions [24]. Table 1 shows the relative abundance of each superfamily, and the numbers of complete and solo-LTR elements identified in the genome. Athila is the most abundant superfamily, followed by the Copia-like, Gypsy-like, and TRIM (terminal-repeat retrotransposons in miniature). The ratio of solo-LTRs to complete elements is around 2:1. In addition to solo-LTR formation, deletion and fragmentation of retrotransposon DNA in A. thaliana also occur via other mechanisms: 36% of the DNA in the Athila, 38% in the Gypsy-like, 32% in the Copia-like, and 21% in the TRIM superfamilies correspond to degraded insertions that are neither 'complete' elements nor solo-LTRs.

Age distribution

To obtain the genome-wide age distribution of each superfamily (except TRIM), 564 pairs of intra-element LTRs were (pairwise) aligned and their sequence divergence estimated. Many of the complete TRIM elements have highly divergent LTRs, and I suspect that extensive recombination between inter-element LTRs has occurred. In neighbor-joining trees of LTR sequences (of both complete and solo elements) from the TRIM families Katydid-At1 and Katydid-At2, most intra-element LTR pairs did not cluster. In contrast, when trees were constructed for representatives of the Athila (athila2), Gypsy-like (atlantys2), and Copia-like (meta1, atcopia49, atcopia78) superfamilies, intra-element LTR pairs always clustered (data not shown), providing evidence for the lack of inter-element recombination in those 'families'.

The superfamilies differ significantly in their average age of insertions. Athila insertions are significantly older than the Gypsy-like (Wilcoxon rank-sum test, p < 0.0005), Gypsy-like older than Copia-like (p < 0.0001). Age distributions are summarized in Figure 1.

Age distributions of LTR-retrotransposon superfamilies. Athila insertions are on average significantly older, and Copia-like ones younger, than those from other superfamilies. There are 34 Copia-like, four Athila, and three Gypsy-like insertions with identical intra-element LTRs. The width of the horizontal boxes above the histograms indicates the middle 50% of age values in each superfamily the red band indicates 95% confidence limits on the median, and the green stripe the median value.

Copia-like insertions are younger than host species

Using the rate of 1.5 × 10 -8 substitutions per site per year [25], 97% of 215 complete Copia-like elements are younger than 3 million years (Myr), 90% younger than 2 Myr, and only two insertions estimated to be older than 4 Myr. This shows that complete insertions from the known Copia-like families in the A. thaliana genome are younger than the species itself, whose time of divergence from its closest relatives, such as A. lyrata has been estimated (with the same rate of evolution) to be 5.1-5.4 Myr ago [25]. The situation is less clear for Athila (and the Gypsy-like TEs), as 7% of 219 intra-element LTR pairs were estimated to be older than 5 Myr (3% of the Gypsy-like). Furthermore, the Athila and Gypsy-like superfamilies have an excess of degraded insertions relative to Copia-like (Table 1). Complete elements account for around 50% of the total amount of DNA in Athila and Gypsy-like, indicating that the majority of insertions remaining in the genome have been degraded or have become solo-LTRs. Some of these are likely to be older than the complete insertions. DNA loss (from LTR-RTs) has been shown to occur in A. thaliana [26], and the oldest insertions may have been degraded beyond detection. On the other hand, there is some evidence that synonymous sites in Arabidopsis are not evolving in a completely neutral fashion [27]. If this were the case for the chalcone synthase (Chs) and alcohol dehydrogenase (Adh) loci, their synonymous sites would be evolving more slowly than LTR-RT fossils, and the dating method described above would systematically overestimate the ages of their insertion events.

Athila and Gypsy-like elements were more active in the past

The age distribution of complete Copia-like elements appears to show a recent burst of activity (Figure 1), but I provide evidence (below) that the excess of very young elements is the result of the rapid (relative to Metaviridae insertions) elimination of these elements from the genome. In contrast, the age distributions of complete Athila and Gypsy-like insertions have peaks between 1 and 2 Myr ago (Figure 1). Moreover, whereas there are 34 Copia-like insertions with their intra-element LTRs identical in sequence, only four such Athila and three such Gypsy-like insertions are present. These results indicate that levels of transpositional activity of Athila and Gypsy-like elements have declined since their peak between 1 and 2 Myr ago.

Physical distribution

The chromosomal distribution of retrotransposons (and other TEs) in A. thaliana has been known to be non-random and dominated by a high concentration of elements in the heterochromatic pericentromeric regions [14]. However, this study has revealed significant differences in the chromosomal locations of the LTR-RT superfamilies. I have analyzed the distribution of complete elements and of solo-LTRs in each superfamily along all the chromosome arms combined, relative to the position of the centromeres (that is, the distribution of the distances between each insertion and the centromere, divided by the length of the respective arm), with results summarized in Figure 2.

Differential pericentromeric clustering of complete elements and solo-LTRs along the 10 chromosome arms combined. The vertical axis measures distance from the centromere, divided by the length of the chromosome arm in which a given element is inserted: the value of 0.0 corresponds to the position of the centromeres and 1.0 to telomeres. Box heights indicate the inter-quartile range and widths are proportional to sample size red bands represent 95% confidence limits on the median and the green stripe marks the median value of each sample. Coordinates for the approximate centers of the centromeres on the chromosome sequences were set at 14.70 Mbp for chromosome I (total length 30.14 Mbp), at 3.70 Mbp for II (19.85 Mbp), at 13.70 Mbp for III (23.76 Mbp), at 3.10 Mbp for IV (17.79 Mbp), and at 11.80 Mbp for V (26.99 Mbp).

Athila elements are almost exclusively inserted in the pericentromeric regions, and the other superfamilies in significantly and progressively less proximal regions of the chromosome arms (Wilcoxon rank sum tests: Athila more proximal than the Gypsy-like, p < 0.0001 Gypsy-like more proximal than Copia-like, p < 0.0001 complete Copia-like elements more proximal than complete TRIM elements, p < 0.05 there is no difference between Copia-like and TRIM solo-LTRs). Furthermore, except for TRIM, within each superfamily the solo-LTRs are significantly more distal than the complete elements (Wilcoxon rank sum tests, p < 0.001), suggesting that formation of solo-LTRs is more likely to occur in distal regions. The distribution of complete TRIM elements relative to the centromere is not significantly different from random (goodness-of-fit test, χ 2 = 4.22, df = 3, p > 0.2), although sample size is small, while their solo-LTRs are significantly clustered (goodness-of-fit test, χ 2 = 10.70, df = 3, p < 0.02).

Accumulation in proximal regions by distinct evolutionary mechanisms: purifying selection and insertion bias

The results above indicate that the older a superfamily is, the more its elements are concentrated in the proximal regions. This suggests that insertions into proximal (heterochromatic) regions are more likely to persist for longer periods of time. This interpretation assumes that the neutral mutation rate is the same for both the distal (euchromatic) and proximal (heterochromatic) portions of the genome. Intra-genomic variation in the per-replication mutation rate has been reported between the two sex chromosomes of a flowering plant [28] (although the difference could not be explained their different degree of DNA methylation, a feature often associated with heterochromatin). Given that the dating method used here is based on neutral sequence divergence (between intra-element LTRs), a higher mutation rate in heterochromatin in A. thaliana would affect age comparisons among different groups of elements, as they show different degrees of clustering into the pericentromeric heterochromatin. However, older estimates for the age of heterochromatic elements are consistent with the hypothesis that heterochromatin is a 'safe haven' where TE insertions persist for longer periods of time. Here I show that the mechanisms that led to the accumulation of LTR-RTs in proximal regions are distinct for different groups: elements of the youngest superfamily (Copia-like) insert randomly into the genome (relative to the location of the pericentromeric heterochromatin), but there is negative selection (on the host genome) against their insertions in euchromatin elements of the older superfamilies (Athila, Gypsy-like) preferentially insert into the pericentromeric regions. These distinct mechanisms become apparent when temporal and spatial data are combined (Figure 3), and the chromosomal distribution of young elements compared with the distribution of older elements (within each superfamily).

Relationship between age and physical distributions of complete elements. Insertions into the short arms of chromosomes II and IV were excluded for clarity. These arms contain extensive heterochromatin away from the centromeres, in nucleolar-organizing regions that juxtapose their telomeres, and in a knob [14]. In addition, their short length implies that the pericentromeric heterochromatin, which spans around 1-1.5 Mbp in each arm [68], corresponds to a substantially higher fraction of their total length than in the other eight arms.

For complete Copia-like elements there is a highly significant negative correlation between relative distance from the centromere and age of the insertions (Spearman rank correlation, ρ = -0.39, p < 0.0001). Furthermore, the distribution along the chromosome arms of 34 Copia-like insertions with no divergence between their intra-element LTRs is not significantly different from random (goodness-of-fit test, χ 2 = 3.12, df = 3, p > 0.3). This is evidence that Copia-like elements integrate randomly relative to the location of the centromeres, but tend to get eliminated from distal, and passively accumulate in proximal regions.

The average time to fixation (t) for a neutral allele is given by t = 4N e, where N eis the effective population size. For A. thaliana t can be estimated using an average of estimates of nucleotide diversity (θ) for 8 different A. thaliana genes, θ = 9 × 10 -3 [29], and the synonymous rate of substitution per site per generation, μ = 1.5 × 10 -8 [25]. t = 2θ/μ, yielding an estimate of t ≈ 1.2 Myr. This value for t is consistent with an independent estimate that placed the time since the divergence between A. thaliana and A. lyrata between 3.45t and 5.6t [30]. Given that 75% of all complete Copia-like insertions are younger than 1.2 Myr, most of them are likely to be polymorphic. Taken together with the highly significant negative correlation between age and distance from the pericentromeric regions, these results indicate that complete Copia-like insertions are less likely to get fixed in the distal, euchromatic portions of the chromosome arms than in the pericentromeric heterochromatin.

In contrast, there is no correlation between age and relative distance from centromeres for complete Athila elements (Spearman rank correlation, ρ = 0.01, p = 0.9), as both young and old insertions are found only in proximal regions (Figure 3), compartmentalized into the pericentromeric heterochromatin. This strongly suggests that elements in the superfamily have evolved to preferentially target the pericentromeric heterochromatin, and their genomic distribution, unlike that of Copia-like elements, is not the result of passive accumulation therein. Only if Athila insertions were much more deleterious than Copia-like ones, so that they would be very rapidly removed by purifying selection, could passive accumulation be the case.

Gypsy-like insertions display a similar pattern to Athila. Even though there is for complete elements a significant, negative correlation between relative distance from centromeres and age, this is due to an excess of recent insertions near the telomere of the short arm of chromosome II (data not shown). If the arm is excluded from the analysis there is no significant correlation (Spearman rank correlation, ρ = -0.09, p > 0.3). This suggests that for the Gypsy-like also there is an insertional bias towards proximal regions. This bias is not as strong as for Athila, as complete Gypsy-like insertions are not exclusively found around the centromeres, and they cluster (to a much lesser extent) in at least one other heterochromatic region (the telomere of the short arm of chromosome II). Included in the Gypsy-like 'superfamily' is a clade of elements, known as Tat, which is a sister group to Athila to the exclusion of the remaining Gypsy-like elements [31]. The age and physical distribution of Tat does not differ from those of the remaining Gypsy-like elements (Wilcoxon rank-sum tests, p > 0.4) Tat show insertion bias towards the pericentromeric regions, but again to a lesser degree than Athila.

Half-life of complete Copia-like insertions

Given that Copia-like elements have been active until recently but tend to be eliminated by purifying selection, their age distribution (Figure 1, bottom) reflects the process of origin and loss of complete elements, when averaged over evolutionary time scales (and over all Pseudoviridae lineages). If this is assumed to be a steady-state process, it can be modeled by the survivorship function: N(K) = N oe -aK , where N(K) is the number of elements observed with intra-element LTR divergence K, and N oand a are constants to be fitted. The rate of elimination can then be estimated by linear regression of the log-transformed data (the half-life of insertions is given by ln2/a). Figure 4 shows the fit for all complete Copia-like insertions (R 2 = 0.94), and for complete insertions outside the proximal regions (i.e. with relative distance from centromeres >0.2 R 2 = 0.95). Complete Copia-like elements are eliminated from the genome with a half-life of 648,000 years (SE = 48,000 years). Insertions exclusively outside the proximal (heterochromatic) regions are lost more rapidly, with a half-life of 472,000 years (SE = 46,000 years).

Loss of complete Copia-like elements. The half-life of complete Copia-like elements throughout the whole genome (log-transformed counts marked by blue circles, blue regression line) is estimated as around 650,000 ± 50,000 years. Complete insertions outside the proximal regions (red squares, red regression line) are lost more rapidly, with a half-life estimated as around 470,000 ± 50,000 years.


Genome-wide polymorphism patterns in crossover regions

Table 1 presents summaries of polymorphism patterns for autosomal (A) and X-linked (X) loci situated in genomic regions where crossing-over occurs (C regions). For ease of presentation, we will refer to nucleotide diversity, π, calculated using 0-fold sites, 4-fold sites and SI sites (positions 8� from the 5′ end of short introns �𠂛p) as π0, π4 and πSI, respectively a similar notational convention will be used for other statistics. For both A and X, and in both the Rwandan (RG) and French (FR) samples, π0, Tajima's D0 (Tajima 1989), and MAF0 are significantly smaller than the corresponding estimates obtained from 4-fold and SI sites (Ppermutationπ.001 in all cases), consistent with the well-known fact that most nonsynonymous mutations are deleterious (Pal et al., 2006 Eyre-Walker and Keightley, 2007), and are therefore kept at low frequencies in the population by purifying selection (Kimura, 1983). Previous studies have suggested that SI sites may be neutrally evolving (Halligan and Keightley, 2006 Parsch et al., 2010). In our data set, πSI seems to be somewhat smaller than π4, which may be due to the stringent data filtering procedure we employed (see Materials and Methods), or the higher GC content at 4-fold sites compared to intronic sites, which in turn is expected to result in an increased mutation rate in 4-fold sites (Singh et al., 2005 Keightley et al., 2009). There is, however, no statistically discernible difference with respect to either MAF or Tajima's D between 4-fold and SI sites ( Table 1 PpermutationϠ.1 for both A and X).

Table 1

ChrSiteWithin population a Between populations b
  Pop. c πTajima's DMAFF UF W
 SI d RG0.0145𢄠.13800.16300.16770.1766
 SI d RG0.0160𢄠.45610.13790.20330.3173

Abbreviations: MAF, minor allele frequency

The FR sample has a lower level of diversity than RG for all three types of sites ( Table 1 ), reflecting a loss of genetic variation induced by population bottlenecks which are believed to have occurred as the species migrated out of Africa (Haddrill et al., 2005b Li and Stephan, 2006 Thornton and Andolfatto, 2006 Hutter et al., 2007 Duchen et al., 2013). The difference in π0 between the two populations is somewhat smaller than those observed for π4 and πSI (for example, on A, π0(FR)/π0(RG)=0.83 versus π4(FR)/π4(RG)=0.77). This is probably because more 0-fold sites are under strong selective constraint, so that variants at these sites behave almost deterministically, and are therefore less sensitive to demographic changes (for example, Zeng, 2013).

To inspect overall patterns of genetic differentiation between the RG and FR populations, we calculated FST (abbreviated here as F see Equation (1) in Materials and Methods), as defined by Weir and Cockerham (1984), using the estimator of Hudson et al. (1992). Two approaches were employed to combine information over multiple SNPs: un-weighted mean F (Equation (5)) and weighted mean F (Equation (6)), which will be referred to as F U and F W , respectively. Because most nonsynonymous mutations are likely to be deleterious, it is expected that levels of population differentiation at these selectively constrained sites should be lower than those at less constrained sites (for example, 4-fold sites) (Barreiro et al., 2008 Maruki et al., 2012). Surprisingly, values of , estimated using either the autosomal or X-linked data, are not statistically different from those of either or ( Table 1 PpermutationϠ.1 in all cases). There is also no detectable difference between and (PpermutationϠ.1 for both A and X). In contrast, was found to be significantly smaller than both and (Ppermutationπ.001 for both A and X), whereas the differences between and remain non-significant (PpermutationϠ.1 for both A and X). The patterns obtained from F U are therefore more compatible with the a priori expectation that 0-fold sites are on average more constrained than 4-fold and SI sites. We will investigate causes for the lack of difference between and either or in a later section.

Several differences between A and X are of note ( Table 1 ). Firstly, consistent with previous reports (Caracristi and Schlotterer, 2003 Hutter et al., 2007 Charlesworth, 2012b Pool et al., 2012 Campos et al., 2013), the X:A ratio in diversity at putatively neutral sites (that is, 4-fold and SI sites) is about 1 in the RG population (π4(X)/π4(A)=1.08 and πSI(X)/πSI(A)=1.10), higher than the null expectation of 3/4. Secondly, the reduction in diversity in FR is more pronounced for X than A for all three types of sites (for example, π4(FR)/π4(RG)=0.41 and 0.77 for X and A, respectively), as reported in previous investigations (Caracristi and Schlotterer, 2003 Hutter et al., 2007). Finally, the extent of population differentiation at both 4-fold and SI sites, as measured by either F U or F W , is significantly higher on the X than on A (Ppermutationπ.001 for all comparisons). This is probably largely driven by the greater reduction in diversity on the X in non-African populations, as values of Dxy, the mean number of nucleotide substitutions between sequences taken from different subpopulations (Nei and Miller, 1990), are comparable between A and X in this study: Dxy,4=1.65 and 1.64%, and Dxy,SI=1.51 and 1.58%. A systematic examination of possible causes of the apparent differences between A and X is beyond the scope of this study the interested reader can refer to previous studies of this topic (Charlesworth, 2001 Pool and Nielsen, 2007 Singh et al., 2007 Pool and Nielsen, 2008 Yukilevich et al., 2010 Charlesworth, 2012b Campos et al., 2013). In what follows, results obtained from A and X will be presented separately.

Limited evidence for selection on codon usage bias affecting patterns of population differentiation at 4-fold degenerate sites

To investigate whether selection on codon usage bias affects differentiation patterns at 4-fold sites, we first examined the relationship between and Fop, as the latter is well known to be correlated with the intensity of selection on codon usage bias (reviewed in Hershberg and Petrov, 2008 Zeng and Charlesworth, 2009). Considering the large variance of the F estimators and the dearth of SNPs in individual genes, we grouped the genes into equal-sized bins with similar numbers of SNPs at 4-fold sites. As shown in Supplementary Figure S2A, Fop and are not correlated on A (Kendall's τ=𢄠.01, PϠ.1). On the X, some evidence for a weak negative correlation was obtained (Supplementary Figure S2B), but it is not statistically significant (Kendall's τ=𢄠.6, P=0.13). When was considered, no correlation was found on either A or X (Supplementary Figures S2E and F). To investigate this further, for the genes within each bin on the X, we tested whether differed from statistically. Amongst the six bins, no evidence of a significant difference was found for the first four bins, whereas the differences were marginally significant for the last two bins with highest Fop (Ppermutation=0.04 and 0.05, respectively). Similarly, we did not detect any correlation between KS and either or (Supplementary Figure S2).

Overall, there is limited evidence that selection on codon usage bias is strong enough to substantially alter patterns of genetic differentiation at 4-fold sites. Considering that 4-fold and SI sites in C regions are comparable with respect to both MAF and F, in what follows, we will use population differentiation patterns obtained from the two types of site as neutral standards, and will refer to them as putatively neutral sites.

Evolutionarily conserved genes are under stronger purifying selection and have reduced F at 0-fold degenerate sites

Genes in C regions were divided into equal-sized bins (with similar numbers of SNPs) based on their KA values between D. melanogaster and D. yakuba. We inspected polymorphism patterns in the RG sample as a function of KA a qualitatively identical set of results were obtained using the FR sample (Supplementary Figure S3). On both A and X, KA was found to be significantly positively correlated with both π0 ( Figures 1a and b A: Kendall's τ=0.989 and Pπ.001 X: Kendall's τ=1 and P=0.009) and Tajima's D0 ( Figures 1c and d A: Kendall's τ=0.884, Pπ.001 X: Kendall's τ=0.867 and P=0.024). No statistically significant relationship was found when comparing KA with Tajima's D4 ( Figures 1c and d Kendall's τ=𢄠.2 and 𢄠.333, PϠ.1, for X and A), although there is a negative correlation between KA and π4 on A ( Figure 1a Kendall's τ=𢄠.6, Pπ.001) (see also Andolfatto, 2007 Haddrill et al., 2011). In particular, on both A and X, π0 and Tajima's D0 approach π4 and Tajima's D4, respectively, as KA increases. In contrast, values of π4 and Tajima's D4, regardless of the KA bin from which they were obtained, remain similar to the values of πSI and Tajima's DSI. These results suggest that 0-fold sites are under stronger constraints than 4-fold and SI sites, and that 0-fold sites in genes with smaller KA are, on average, under stronger purifying selection. We obtained the same results when we used the D. simulans genome as an out-group (Supplementary Figure S4).

Polymorphism patterns within 17 Rwandan D. melanogaster lines for coding sequence (CDS) binned by KA value (to D. yakuba), and for sites in the 8�𠂛p regions of short introns �𠂛p (SI sites). (a) Nucleotide diversity (π) for autosomal CDS-C and (b) X-linked CDS-C regions (c) Tajima's D for autosomal CDS-C regions and (d) X-linked CDS-C regions. The x axes show the maximum KA value in each bin. Symbols: 0-fold degenerate sites—open circles 4-fold degenerate sites—open triangles SI sites—open red squares.

Figures 2a and b show that evolutionarily conserved genes have significantly smaller (A: Kendall's τ=0.663, Pπ.001 X: Kendall's τ=0.867, P=0.02). Again, we obtained the same result when using D. simulans as the out-group (Supplementary Figure S5). The pattern remains statistically significant for autosomes when was considered (Supplementary Figure S6). The reduction in F0 for genes with smaller KA is associated with a strong reduction in MAF0 ( Figures 2c and d ) and an increase in the proportion of 0-fold SNPs that are private to one of the two populations ( Figures 2e and f ), both of which are hallmarks of selection against deleterious mutations (cf., recent findings in humans Nelson et al., 2012 Fu et al., 2013), and are expected to drive both F U and F W downwards, as shown in Materials and Methods (see also Maruki et al., 2012 Bhatia et al., 2013 Jakobsson et al., 2013). For the 4-fold sites on both A and X, no correlation with KA was observed for F U , F W , MAF and the proportion of private SNPs ( Figure 2 PϠ.1 in all cases based on Kendall's τ).

Differentiation patterns between 7 French and 17 Rwandan D. melanogaster lines for coding sequence (CDS) binned by KA value (to D. yakuba), and for SI sites. (a) Unweighted mean FST (F U Equation (5)) for autosomal coding CDS-C and (b) X-linked CDS-C regions (c) population-average MAF for autosomal CDS-C regions and (d) X-linked CDS-C regions (e) the proportion of SNPs per bin in which one allele was private to one of the D. melanogaster populations for autosomal CDS-C regions and (f) X-linked CDS-C regions. Symbols: 0-fold degenerate sites—open circles 4-fold degenerate sites—open triangles SI sites—open red squares.

The data presented in Figures 1 and ​ and2 2 suggest that the lack of difference between and either or reported in the previous section is probably because of the fact that F W gives more weight to SNPs with higher expected levels of polymorphism (for example, nearly neutral variants), as we have shown in Materials and Methods. In other words, when all 0-fold sites in C regions were analysed together ( Table 1 ), the effects of purifying selection on a substantial fraction of 0-fold sites were probably masked by those 0-fold sites that are nearly neutrally evolving. Consequently, the overall distribution of appears non-distinguishable from those of and . In contrast, F U gives equal weight to all SNPs. Considering that the value of F when calculated using a single SNP is constrained by MAF (see Equation (3) in Materials and Methods), F U is expected to be more sensitive to the action of purifying selection than F W , consistent with the observation reported above. In the Discussion, we will further explore the implications of these statistical properties of F, which arise when information from multiple SNPs is combined.

Longer introns are under stronger selective constraints and are less differentiated

In agreement with earlier findings (Haddrill et al., 2005a Halligan and Keightley, 2006), longer introns tend to have lower divergence (K) between D. melanogaster and D. simulans (A: Kendall's τ=𢄠.635, Pπ.001 X: Kendall's τ=𢄠.486, Pπ.001 Figures 3a and b ), probably as a result of the presence of functional elements that are subject to purifying selection (Bergman and Kreitman, 2001 Parsch, 2003 Andolfatto, 2005 Haddrill et al., 2005a Halligan and Keightley, 2006 Casillas et al., 2007 Roy et al., 2010). Here, we report further support for this hypothesis by examining within-population polymorphism patterns as a function of intron length. Consistent with the action of purifying selection, longer introns have lower π ( Figures 3c and d ) and more negative Tajima's D ( Figures 3e and f ) compared with 4-fold and SI sites (similar results were observed in the FR sample see Supplementary Figure S7). Interestingly, the patterns of divergence and polymorphism level off for introns longer than 2000𠂛p. Using the RG sample, the values of π and Tajima's D obtained from introns longer than 2000𠂛p are 0.0072 and 𢄠.5476 for A, and 0.0076 and 𢄠.9013 for X, respectively all these values are substantially lower than the corresponding values observed at 4-fold and SI sites, but are higher than those obtained from 0-fold sites (see Table 1 ). Furthermore, the KA values for CDS in C regions between D. melanogaster and D. simulans are 0.015 and 0.018 for A and X, respectively, which are significantly smaller than the values of K for long introns �𠂛p on A and X, which are 0.061 and 0.074, respectively (Mann–Whitney U test, Pπ.001). These results imply that long introns, especially those �𠂛p, are more constrained than the 4-fold and SI sites, but probably contain fewer strongly selected sites than 0-fold sites.

Divergence and polymorphism patterns for intronic sites binned by intron length. (a) Divergence (K) between D. melanogaster and D. simulans for autosomal introns and (b) X-linked introns (c) nucleotide diversity (π) for autosomal introns and (d) X-linked introns (e) Tajima's D for autosomal introns and (f) X-linked introns. The x axes display the maximum intron length in each bin. Note that the number of SNPs in each autosomal intron bin is roughly the same as that in the autosomal SI bin the same applies to the X-linked data. Symbols: Long intronic sites—open circles positions 8�𠂛p sites of short introns �𠂛p (SI sites)—open red squares.

Estimates of F W , when calculated using sites from introns more than 65𠂛p in length, were 0.171 and 0.283 for A and for X, respectively. None of these was found to be statistically different from the corresponding values estimating using 4-fold and SI sites reported in Table 1 (PpermutationϠ.1 in all cases). F U for introns 㹥𠂛p were 0.157 and 0.174 for A and X, respectively, both of which were significantly smaller than both and (Ppermutationπ.001 in all cases). There is a clear negative relationship between F U and intron length ( Figures 4a and b for A and X, Kendall's τ=𢄠.356 and 𢄠.364 P=0.010 and Pπ.001, respectively), which mirrors that between MAF (or the prevalence of private SNPs) and intron length (Supplementary Figure S8), and is consistent with the expected effect of purifying selection on genetic differentiation between populations. The relationship between differentiation and intron length is weaker when F W was analysed (Supplementary Figure S8 for A and X, Kendall's τ=𢄠.271 and 𢄠.146, and P=0.05 and 0.16, respectively). These differences between F W and F U can be explained by the fact that fewer sites in introns 㹥𠂛p are expected to be strongly selected compared with 0-fold sites. As discussed in the previous section, F W , which tends to reflect differentiation patterns at neutral sites in the data, is less likely to recover signatures of purifying selection compared to F U .

Differentiation between 7 French and 17 Rwandan D. melanogaster lines for long intronic sites binned by intron length, and for SI sites. (a) Unweighted mean FST (F U Equation (5)) for autosomal introns and (b) X-linked introns. Symbols: Long intronic sites—open circles SI sites—open red squares.

Regions with reduced recombination tend to have higher F

It is known that genomic regions that lack crossing over (NC regions) have very different patterns of divergence and polymorphism than those seen in C regions (Haddrill et al., 2007 Betancourt et al., 2009 Arguello et al., 2010 Campos et al., 2012 Campos et al., 2014). In Table 2 , we present summary statistics of the NC data pertinent to the current study (see Materials and Methods for a list of the NC regions considered). It can be seen that, for both 0-fold and 4-fold sites, values of F in NC regions are generally higher than those obtained using the same type of site in C regions, regardless of the way in which information from multiple SNPs was combined. Specifically, the average KA to D. yakuba is about 0.05 for the NC loci (Campos et al., 2012). calculated using autosomal and X-linked NC data are 0.1817 and 0.3012, respectively ( Table 2 ), higher than the values of 0.1569 and 0.1685 for autosomal and X-linked genes in C regions spanning the same KA values ( Figures 2a and b Ppermutation=0.05 for A and Ppermutationπ.001 for X).

Table 2

ChrSiteWithin populationBetween populations
  Pop.πTajima's DMAFF UF W

Abbreviations: FR, French MAF, minor allele frequency RG, Rwandan.

The statistics were obtained in the same way as in Table 1 see Materials and Methods for more details.

It should be noted that the elevation in F in NC regions is probably caused by an extreme reduction in within-population diversity induced by tight linkage between a large number of selected sites ( Table 2 Kaiser and Charlesworth, 2009 O'Fallon et al., 2010 Seger et al., 2010 Zeng and Charlesworth, 2010). This is because F is a relative measure of differentiation (see Equation (1)), and therefore all else being equal, F is expected to be elevated by forces that reduce within-population diversity (that is, πS in Equation (1)), irrespective of whether diversifying selection or reduced gene flow has affected the genomic region under study (Charlesworth, 1998 Noor and Bennett, 2009).

To further examine the effects of selection at linked sites, we inspect the correlation between F at putatively neutral sites and local recombination rates in C regions. Figure 5 presents results based on autosomal loci, where it can be seen that is reduced with more frequent recombination (Kendall's τ=𢄠.474, P=0.004 the data point obtained from the NC regions was not included in the calculation). However, there is no statistically significant relationship between recombination rate and ( Figure 5b Kendall's τ=𢄠.179 and P=0.28). Weak negative correlations were also found on the X chromosome for 4-fold and SI sites (Supplementary Figure S9). The patterns remained unchanged when F W was used (Supplementary Figure S10).

Differentiation between 7 French and 17 Rwandan D. melanogaster lines for 4-fold degenerate sites and SI sites in C regions as a function of local recombination rate, and for 4-fold degenerate sites in NC regions. (a) F U for autosomal CDS regions and (b) autosomal SI regions.

ChIP-seq experiments and data analysis

To characterize the CTCF binding profile in Mus musculus castaneus (CAST/EiJ) and M. spretus (SPRET/EiJ), we performed chromatin immunoprecipitation experiments followed by high-throughput sequencing (ChIP-seq) using adult liver tissue. ChIP-seq libraries and input control libraries from three biological replicates of each species were prepared as described in [62]. Subsequently, libraries were sequenced on a HiSeq2000 (Illumina) to produce 100-bp paired-end sequence fragments.

In addition, we obtained published CTCF ChIP-seq data from the livers of Mus musculus domesticus (C57BL/6J), Mus caroli/EiJ, and M. pahari/EiJ [35]. Three biological replicates from each species were used.

We aligned sequenced reads from CAST and M. spretus to the reference genome assemblies CAST_EiJ_v1 and SPRET_EiJ_v1 [63], respectively, with BWA mem version 0.7.12 [64] discarding reads with more than three occurrences. We also mapped the retrieved raw ChIP-seq reads from C57BL/6J, M. caroli, and M. pahari to the genomes GRCm38 (mm10), CAROLI_EIJ_v1.1, and PAHARI_EIJ_v1.1 [63, 65], respectively, using the same method for the sake of performing matched analyses in all species. CTCF enrichment peaks were called with MACS 1.4.2 [66] with a p value threshold of 0.001. For downstream analyses, we used peaks identified in at least two replicates of each species (Additional file 1: Table S1). To produce binding heatmaps (Additional file 1: Figure S1B), we used deeptools version 3.3.1 [67]. We first subtracted the appropriate input library from each ChIP sequencing library using the bamCompare tool. Then, for each species, we produced heatmaps corresponding to the number of ChIP reads—input reads within all peaks found in at least two replicates using the computeMatrix and plotHeatmap tools.

We also performed ChIP-seq in C57BL/6J liver to identify genomic regions enriched for the cohesin subunit RAD21, using also an input control library from C57BL/6J liver from Thybert et al. [35]. Sample preparation and chromatin immunoprecipitation was performed as described in Schmidt et al. [34] using 10 μg RAD21 antibody (Abcam, ab992, lot GR12688-8). Immunoprecipitated DNA and 50 ng of input DNA were used for library preparation using the ThruPLEX DNA-Seq library preparation protocol (Rubicon Genomics, UK). Library fragment size was determined using a 2100 Bioanalyzer (Agilent). Libraries were quantified by qPCR (Kapa Biosystems). Pooled libraries were deeply sequenced on a HiSeq2500 (Illumina) according to the manufacturer’s instructions to produce single-end 50-bp reads. We obtained sequenced reads and mapped them to the mouse genome assembly GRCm38 using BWA 0.6.1 [64]. We then called RAD21 peaks using MACS2 2.1 with default options [66].

We used the boundaries of mouse liver TADs published by Vietri Rudan et al. [15]. We considered TAD boundaries as the start and end nucleotides of each TAD, while in some of the analyses (where indicated in the following method description), we used a window of ± 50 kb around them to study TAD boundary regions.

Conservation of CTCF binding sites in Mus species

To investigate the conservation of CTCF binding across the studied Mus species, we first found the orthologous alignments of the CTCF ChIP-seq peaks in the genomes of the other species. These orthologous CTCF regions across mice were obtained using an extended version of the eutherian mammal Endo-Pecan-Ortheus (EPO) multiple genome alignment that also included the genomes of CAST, M. spretus, M. caroli, and M. pahari [35]. Once the orthologous regions of CTCF sites were identified in all Mus species, we cross-validated the binding of CTCF in each species using the corresponding ChIP-seq data. Specifically, we considered that a CTCF site was conserved if it (a) had an orthologous alignment across species and (b) the orthologous alignments also contained a CTCF ChIP-seq peak (Fig. 1c).

Binding affinity and sequence constraint of CTCF motifs

To identify CTCF binding motifs, we retrieved the FASTA sequences of all CTCF peaks in C57BL/6J, using bedtools getfasta v.2.25.0 [68], and scanned these sequences for the primary CTCF binding motif (M1) from the JASPAR database [69] using Find Individual Motif Occurrences (FIMO) from the MEME suite v.4.12.0 [70, 71] with default parameters. We extended the identified 19 base-long M1 motifs to include 20 bases upstream and 20 bases downstream in order to allow the discovery of the extended version of the motifs (M1 and M2). Finally, we calculated the binding affinity of these sequences for CTCF using DeepBind v.0.11 [72], as in Aitken et al. [55], and compared the significance of the difference between distributions of the affinity values between motifs found in TAD boundary-associated and non-TAD boundary-associated CTCF peaks at each conservation level (Fig. 2a, b).

To retrieve rejected substitution (RS) scores for each position of every identified 19 base-long M1 motif in C57BL/6J, we obtained pre-calculated GERP [42] conservation scores for each nucleotide of these mouse M1 sequences from Ensembl [73]. The RS score of a genomic position was calculated as the difference of observed to expected substitutions. We then averaged the RS score per position among all motifs and compared these averaged RS scores of TAD boundary-associated M1 motifs with non-TAD boundary-associated motifs (Fig. 2e, f).

ChIP-seq enrichment and read coverage of identified CTCF peaks

The CTCF sites that we identified in each species were the intersection of the CTCF peaks called in ≥ 2 biological replicates. We calculated the ChIP-seq fragment enrichment of each CTCF site by averaging the ChIP enrichment scores, reported by MACS, over the replicates. We then compared the significance of the difference between the distributions of average ChIP enrichment between TAD boundary-associated and non-TAD boundary-associated CTCF sites of each conservation level using Mann-Whitney U tests (Fig. 2c, d).

We used bedtools multicov v.2.25.0 to calculate the counts of read alignments at TAD boundary-associated versus non-TAD boundary-associated CTCF peak regions, in a total of five C57BL/6J replicates (Additional file 1: Figure S6). To increase the robustness of our observations, we added two additional replicates to the three initial ones, which we processed in the same way as the other replicates (see the “ChIP-seq experiments and data analysis” section).

Motif word usage analysis

We scanned all CTCF peaks from each of the five species for the primary CTCF binding motif (M1) using FIMO from the MEME suite as described above. From the 19 base M1 motif instances identified in each species, we retrieved the central most informative 14-mer and estimated its frequency of occurrence as the number of occurrences of the 14-mer word in CTCF binding regions divided by the number of occurrences of the word in the whole genome of the species using the procedure of Schmidt et al. [34]. We filtered out any motif word that occurred fewer than five times in the whole genome. We illustrated the occurrence frequency of the motif words in each species on a heatmap which is sorted by distance to the closest TAD border (Additional file 1: Figure S7).

Association of CTCF binding sites with classes of transposable elements

We used the full set of CTCF sites identified in all species and projected them on to the C57BL/6J genome (GRCm38), as well as published transposable elements in C57BL/6J (Thybert et al. [35] We intersected the center of each CTCF binding site with the transposable elements and reported the number of CTCF site centers that overlapped with each TE class. The overall representation of each TE class in the whole genome that is shown as a reference (marked as “background” in Fig. 3a) was calculated as the total length of all TEs belonging to each class (SINE, LINE, LTR, DNA) sequences divided by the total genome length.

Representation of TE classes at TAD boundary regions

As for Fig. 3b, we defined TAD boundary regions as genomic windows of 50 kb upstream and 50 kb downstream of the boundaries of TADs. To evaluate the representation of each TE class, we summed the length of sequences corresponding to each TE class that occurred within each TAD boundary region and divided that by the total length of the TAD boundary region, i.e., 100 kb. To retrieve random genomic regions of similar length and distribution, we shuffled the TAD boundary regions using bedtools shuffle v2.2.5.0, having first excluded chromosome Y, genome scaffolds, and chromosome ends, where TADs are not called. We repeated the same calculation for TE class representation as above for these shuffled TAD boundaries, i.e., random genomic regions. We then plotted the distribution of these values for TAD boundary regions and random genomic regions. To determine the representation of each TE class in the background genome (dotted line in Fig. 3b), we divided again the total length of all sequences that correspond to each TE class by the total C57BL/6J genome (GRCm38) length, analogous to the CTCF TE class analysis above.

Density of CTCF sites at TAD boundaries and clusters of CTCF binding sites

To determine the enrichment of CTCF binding sites in TAD boundary regions (compared to the surrounding genome), we measured the distance of each CTCF binding site to its closest TAD boundary using bedtools closest. We then categorized the CTCF sites based on their conservation level. For each CTCF site conservation level, we grouped all distance values up to ± 300 kb in bins of 20 kb and plotted the number of CTCF sites in each bin divided by the length of the bin, i.e., 20 kb (Fig. 4a). To further characterize the density of CTCF sites at TAD boundaries, we grouped CTCF sites both according to their conservation level and association with a TAD boundary (versus no association with any TAD boundary), and for each of these categories, we found the distance of each CTCF site from its closest CTCF site using bedtools closest (Fig. 4b).

To identify clusters of CTCF binding sites, we used the full set of CTCF binding sites of all five Mus species projected onto the C57BL/6J genome (GRCm38/mm10), as shown in Fig. 1c. We identified instances of consecutive CTCF sites that were up to 10 kb apart from each other, using bedtools cluster. We then determined and compared the enrichment of clustered and singleton CTCF sites at TAD boundaries using the same approach as in Fig. 4a but having categorized the CTCF sites based on whether they belong to a cluster (clustered) or not (singletons) (Fig. 4c).

For Fig. 4d, e, we again defined TAD boundary regions as TAD boundary ± 50 kb. We categorized these regions based on the highest conservation level of their CTCF sites. Subsequently, for each category, we counted its total number of CTCF sites (Fig. 4d), as well as the number of these TAD boundary regions with clustered CTCF sites and with only singleton sites (Fig. 4e).

For Additional file 1: Figure S8, we defined Mus-conserved (5-way) CTCF sites with a distance to the closest TAD border > 80 kb as non-TAD boundary associated. We calculated the enrichment of 1-way (species-specific), 2-way, 3-way, and 4-way conserved CTCF sites in their vicinity in the same way as in for TAD boundaries (Fig. 4a), but using as anchor the non-TAD boundary-associated 5-way CTCF sites themselves, instead of the TAD boundaries.

Clusters in C57BL/6J and cluster conservation analyses

We identified clusters of CTCF binding sites in C57BL/6J (Additional file 1: Figure S9) in the same way as for Fig. 4c but using only CTCF peaks called in C57BL/6J. We used the same methods as for Fig. 4a, c to determine the enrichment of CTCF sites of different conservation levels at TAD borders (Additional file 1: Figure S9A), as well as the enrichment of clustered versus singleton CTCF sites (Additional file 1: Figure S9B).

To estimate the conservation of CTCF sites clusters (Additional file 1: Figure S9D), we identified all the genomic regions that correspond to clusters of CTCF sites in each of the five species separately. We then projected through whole-genome alignments (see the “Conservation of CTCF binding sites in Mus species” section) the cluster regions of each species onto the C57BL/6J genome and determined whether they overlap with the orthologous cluster regions of the other species.

RNA-seq data

We retrieved published liver-derived RNA-seq data from six biological replicates for each of the species C57BL/6J and M. m. castaneus [74], as well as from four biological replicates of M. caroli [75]. To have the same number of replicates in each species, we further generated and sequenced two additional RNA-seq libraries for M. caroli following the methods described in Goncalves et al. [74] and Wong et al. [75]. Briefly, total RNA was extracted from two independent liver samples using Qiazol (Qiagen) and DNase treated with DNA-free DNA Removal Kit (Ambion). Polyadenylated mRNA was enriched, directional double-stranded cDNA was generated, fragmented by sonication, and prepared for sequencing. Each of the two libraries was sequenced on an Illumina GAIIx to generate 75-bp paired-end fragments.

RNA-seq data processing and analysis

Adapter sequences were trimmed off with reaper from the Kraken tool suite [76]. The paired-end RNA-seq reads from each replicate of C57BL/6J, CAST, and M. caroli were mapped to the corresponding species' genomes (see the “ChIP-seq experiments and data analysis” section) using STAR 1.5.2 [77] with default settings. Raw reads mapping to annotated genes were counted using htseq-count [78]. We then used the raw read counts to perform differential expression analyses with DESeq2 1.20.0 [79] with default settings.

To determine the gene expression patterns around instances of 5-way conserved CTCF sites and species-specific CTCF site losses at TAD boundaries (Fig. 7a, d, g), we first identified the closest upstream and downstream gene in each species using the gene annotation from Ensembl version 95 [65] and then calculated the relative gene expression of downstream to upstream gene in each species. We were not interested in the relative expression of the gene pair flanking a CTCF site per se, but in whether this ratio for each CTCF site is consistent between species when the in-between CTCF binding separating them changes. For this reason, we only used CTCF sites that were flanked by 1:1 orthologous genes between the three species. We went on to use DESeq2 [79] in order to compute the log2(fold change) between the downstream and upstream gene—as a measure of the relative expression of genes flanking each CTCF site—in each species and to subsequently compare this log2(fold change) between species. Since DESeq2 is not designed to normalize for gene lengths, and our aim was to generate comparable expression pattern estimations between the species, we also required all the orthologous genes that we used to have a similar length among the three species (0.7 < len_ratio < 1.3, where len_ratio is the length of gene in species A divided by the length of its orthologous gene in species B). Finally, we compared the calculated log2(fold change) values for each gene pair in C57BL/6J with the corresponding value of its orthologous gene pair in CAST (Fig. 7b, e, h) and in M. caroli (Fig. 7c, f, i).


Data access

Identification of deMPs

The general idea of identification of deMP is first to identify the candidate deactivating mutations that disrupt a putative binding site and next to score these candidate deactivating mutations using CAPE, a tool we developed to identify causal regulatory variant in enhancer regions [15]. All three possible mutations at a genomic position, regardless of whether they exist as human SNPs, were scored by CAPE. The mutations with significant CAPE scores were considered to be deactivating mutations. The genomic positions holding at least one deactivating mutation were dubbed deactivating mutation positions (deMPs).

Specifically, we utilized the k-mer vocabularies trained on ChIP-seq enhancers to infer the sequence specificities of TFBSs. The enriched k-mers (k = 8) were assumed to be the potentially functional TFBSs [6] on ChIP-seq enhancers. To identify the enriched k-mers in HepG2 enhancers, we first generated a set of controls for each enhancer sequence. Controls were randomly sampled from the whole genome with the same GC-content, repeat-content, and length as the corresponding enhancer sequence. Five control sequences were extracted for each enhancer. In cases when not enough controls with our strict criteria (ΔGC-content ≤ 0.005, Δrepeat-content ≤ 0.01) could be identified, we created additional controls by reshuffling enhancer sequences. For each of the possible 32,896 k-mers (k = 8), we used the Fisher exact test to evaluate the enrichment of k-mers in the HepG2 enhancer set and identified the top 522 k-mers significantly enriched in enhancers (p ≤ 1e−3 after the Bonferroni correction) as potentially functional k-mers and 30,647 insignificant k-mers (p > 1e−3 without the Bonferroni correction) as background k-mers.

As we did in our previous study [6], we applied a modified intragenomic replicates (IGR) model [38] to recognize candidate deactivating mutations that change a top k-mer to a background k-mer once we identified top k-mers in the positive training set. The candidate deactivating mutations were next scored by CAPE. For the HepG2 cell line, only the mutations with significant CAPE scores (CAPE score ≥ 0.57156, corresponding to FPR ≤ 0.01) were considered to be deMs. We used the change of the associated k-mers to identify the candidate deactivating mutation before applying CAPE due to the limitation of the CAPE score. The output of CAPE is the probability of a mutation being a causal regulatory variation by either decreasing or increasing the enhancer activity. Since we are particularly focused on mutations deactivating enhancers, we need to limit the candidate deactivating mutations to the ones that could disrupt a potential binding site using the k-mer vocabularies.

To identify the deMPs in the left ventricle, we trained CAPE on the human left ventricle eQTLs by integrating the regulatory signals of this tissue (H3K27ac, H3K4me1, H3K4me3, P300, DNase, H3K36me3, H3K27me3, H3K9me3). We then scored all possible single-nucleotide variants (SNVs) in the left ventricle enhancer region. Only the mutations with CAPE score ≥ 0.58276 (FPR ≤ 0.01) were identified to be deMs (Additional file 1: Figure S14). The top 20% of enhancers with the most abundant deMPs correspond to fragile, and the bottom 20% of enhancers devoid of deMPs correspond to stable enhancers, respectively. The top 5% mutations with the highest CAPE scores and 5% random mutations of the two stable enhancers (hs1760 and human ortholog of mm69) are listed in Additional file 2: Tables S6-S7.

Functional enrichment analysis using GREAT

Functional enrichment of enhancers was performed using the online Genomic Regions Enrichment of Annotations Tool (GREAT) version 3.0.0 [21]. In the GREAT figure (Fig. 3a), the default distance parameter was applied for the regulatory domain assignment of genes, and the single nearest gene rule was applied to associate enhancers with genes. The Gene Ontology (GO) biological process terms were included only if they satisfied the strict criteria in at least one category of enhancers: (1) binomial p value ≤ 1e−4, (2) a minimum binomial observed region hits and hypergeometric observed gene hits of 10, and (3) a minimum binomial region and hypergeometric gene set fold enrichment of 2. The −log10 binomial p values were plotted on the y-axis. To show that the GO enrichment of both fragile and stable enhancers is robust by different gene association rules, the other two gene association options (“basal plus extension” and “two nearest genes”) were also applied (Additional file 1). To compensate for the bias caused by assigning all the enhancers to their nearest genes, 45% of enhancers were randomly relocated before applying GREAT for 10 times (Additional file 1).

Enrichment analysis of GWAS traits

The NHGRI GWAS Catalog was downloaded in September 2016 [1]. The GWAS traits that coincided with the single-nucleotide polymorphisms (SNPs) of the three sets of enhancers were first grouped by disease type (Additional file 2: Table S4). To study the enrichment of a set of SNPs coinciding with a certain disease type, the tag SNPs coinciding with the GWAS traits were further expanded by linkage disequilibrium (LD) (r 2 > 0.8, maximum distance of 500 kb). The enrichment of stable enhancer SNPs coinciding with a disease type relative to fragile enhancer SNPs was evaluated as −logP of the hypergeometric distribution, and vice versa.

Identification of potential TFBSs in the three sets of enhancers

For the purpose of identifying the location of potential binding sites, we used the profiles of binding sites for vertebrate TFs stored in Jaspar [39], CIS-BP [40], SwissRegulon [41], HOCOMOCO [42], and UniPROBE [43] databases. We trained an in-house developed tool called tfbsFrag on random sequences to create optimized position-specific scoring matrices (PSSM) identified by FIMO [44] to maintain the rate of false-positive discoveries in a real genomic sequence to about five false positives in 10 kb of sequence. We then used tfbsFrag and the optimized vertebrate PSSMs to scan the enhancer sequences of the three classes. The human reference genome hg19 was hard-masked to eliminate the transposable elements when searching for potential TFBSs. Five random sequences were generated for each enhancer sequence with strict criteria (ΔGC-content ≤ 0.005, Δrepeat-content ≤ 0.01), which were used for PSSM identification and for determining the TFBSs enrichment of an enhancer set relative to the background. The occurrence of a particular TFBS in a set of enhancer/random sequence was normalized by the total length of non-repetitive enhancer/random regions. Then, the enrichment of the TFBSs of TF A (i.e., TFBSA) in a set of enhancers is determined by formula 1.

If an enhancer harbored at least three potential binding sites for a TF expanding no more than 1 kb, we assumed that this enhancer had at least one homotypic TFBS cluster. Analogously, if an enhancer harbored at least three potential binding sites for different TFs expanding no more than 1 kb, we assumed that this enhancer had at least one heterotypic TFBS clusters.

Partition CAPE score

CAPE is a support vector machine-based classifier aimed at predicting causal regulatory variant [15]. In brief, it learns sequence code from large-scale chromatin profiling data of multiple signal tracks, including DNase-seq, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H2A.Z, P300, and major TF binding data of the corresponding tissue. Two sequence signatures, namely, the disruptive effect of the mutation on major TF binding (Δ) and the co-binding of TFs in its neighborhood (S), are the basic component of features for each signal (Fig. 1a). In all, CAPE integrates (Nk × NkmerSignature × NsignalTrack) features. Nk (= 5) is the number of k-mer sizes (k = 4, 6, 8, 10, 12). NkmerSignature (= 2) is the number of signatures including the binding affinity change of the potential binding site due to the mutation (Δ) and the overall binding capabilities of the nearby sequence context of the genetic variant (S). NsignalTrack is the number of the chromatin data (Fig. 1b). The optimal weights for the features learned from the fivefold cross-validation of the eQTL model of HepG2 cell line [15] are listed in Additional file 2: Table S8. The optimal hyperplane of the classifier can therefore be partitioned to two components—the weighted sum of disruptive effect on the cognate motif (denoted as WS(Δ)) and the weighted sum of the co-binding of other TFs in the flanking region (denoted as WS(S)) (formula 2).

where w1kj and w2kj are the optimal weights learned from the training set of the eQTL model.

Mouse transgenic reporter assays

Human enhancer regions (see Additional file 2: Tables S9-S10 for sequences) were PCR amplified from human genomic DNA (wild-type) or chemically synthesized by Integrated DNA Technologies (IDT) (5% top deM and 5% random non-deM mutations) and cloned into an Hsp68-promoter-LacZ reporter vector [46] using Gibson (New England Biolabs [NEB]) cloning [47]. Transgenic mouse embryos were generated by pronuclear injection, and F0 embryos were collected at E11.5 and stained for LacZ activity as previously described [45, 46]. The procedures for generating transgenic and engineered mice were reviewed and approved by the Lawrence Berkeley National Laboratory (LBNL) Animal Welfare and Research Committee.


Numerous proteins in most life forms, but particularly animals and plants, contain compositionally ordered regions, which consist of recurring motifs, such as short tandem repeats, periodic structures and repetitive domains 1,2,3,4,5 . Hereinafter we refer to such recurring motifs simply as repeats. Repeats are crucially important, in particular, as building material for scaffolds of various macromolecular complexes, for example, nuclear pores 6,7 , the proteasome 8 or mechanotransduction channels 9 . Examples of the most abundant repeats with scaffolding functions include ankyrin, tetratricopeptide (TPR) and WD40 repeats 10,11,12,13,14,15 . Repeats are also important in essential biochemical functions such as transcription regulation as exemplified by the extremely common Zn-finger repeats 16,17 .

Repeats can emerge by means of replication slippage and recombination 18,19 , grow into longer units 20 , and diverge by accumulating mutations. New repeats represent a major source of genetic variation, often associated with fast evolution and acquisition of new functions 21,22,23 . Striking examples, from diverse organisms, of the role played by gain and loss of protein repeats in microevolution include variation in the clock gene period, which is responsible for adaptation of the circadian clock to temperature in Drosophila 24 , the Runx-2 gene, associated with morphological changes in dogs 25 , and cell wall proteins, leading to new cell adhesion phenotypes in fungi and protists, and thought to allow for evasion of the host immune system 26 .

Several comparative studies have shown that repetitive regions in proteins are globally conserved across species 27,28,29,30 , indicating that repeats are functional but also that fast evolution is rare 29 . Despite this strong evidence of the functionality and evolutionary conservation of repeats, repeat variation is also a known molecular driver of genetic disease 31,32 , which indicates the importance of rapid change in repetitive regions of proteins. Furthermore, rapid evolution of protein repeats plays key roles in various aspects of immunity as exemplified by the leucine-rich repeats, which are the key structural components of innate immunity proteins, such as animal Toll-like receptors and plant disease resistance proteins, as well as adaptive immunity components in jawless vertebrates 33,34,35,36,37,38 .

Thus, there seems to be a conundrum between the overall evolutionary conservation in repetitive regions of proteins and rapid change of repeats associated with a variety of biological processes. Here we resolve this apparent contradiction by revealing a dramatic difference between the regimes of intra-protein (horizontal) evolution of repeats and inter-protein (vertical) evolution of repeats in orthologous proteins.

To analyse the evolution of repeats and maximize the likelihood that evolutionary rates can be estimated, we develop a rigorous method to extract repeats with conserved length and significant sequence similarity from protein sequences. We validate it and apply it to systematically compare the horizontal and vertical evolution of repeats in diverse groups of organisms. We show that repeats are highly conserved between species, while horizontally propagating and diverging. Thus, each fixed repeat appears to be functionally important in itself and hence subject to purifying selection, whereas in the initial phase of the evolution of repetitive regions, a combination of strongly relaxed purifying selection and positive selection drives fast horizontal divergence of the repeat sequences, presumably yielding new functions. Because variation of repeats plays a crucial role in human disease, in particular neurodegeneration and cancer, the methodology employed here provides means to study somatic horizontal evolution of repeats, and could contribute to the identification of disease drivers associated with this mutational class.

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.


The discovery of the role of DNA in heredity, and observations by Frederick Sanger of variation between animal insulins in 1949, [2] prompted early molecular biologists to study taxonomy from a molecular perspective. [3] [4] Studies in the 1960s used DNA hybridization and protein cross-reactivity techniques to measure similarity between known orthologous proteins, such as hemoglobin [5] and cytochrome c. [6] In 1965, Émile Zuckerkandl and Linus Pauling introduced the concept of the molecular clock, [7] proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms diverged. While initial phylogenies closely matched the fossil record, observations that some genes appeared to evolve at different rates led to the development of theories of molecular evolution. [3] [4] Margaret Dayhoff's 1966 comparison of ferrodoxin sequences showed that natural selection would act to conserve and optimise protein sequences essential to life. [8]

Over many generations, nucleic acid sequences in the genome of an evolutionary lineage can gradually change over time due to random mutations and deletions. [9] [10] Sequences may also recombine or be deleted due to chromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate. [11]

Conservation can occur in coding and non-coding nucleic acid sequences. Highly conserved DNA sequences are thought to have functional value, although the role for many highly conserved non-coding DNA sequences is poorly understood. [12] [13] The extent to which a sequence is conserved can be affected by varying selection pressures, its robustness to mutation, population size and genetic drift. Many functional sequences are also modular, containing regions which may be subject to independent selection pressures, such as protein domains. [14]

Coding sequence Edit

In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the genetic code means that synonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product. [15]

Amino acid sequences can be conserved to maintain the structure or function of a protein or domain. Conserved proteins undergo fewer amino acid replacements, or are more likely to substitute amino acids with similar biochemical properties. [16] Within a sequence, amino acids that are important for folding, structural stability, or that form a binding site may be more highly conserved. [17] [18]

The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The codon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause secondary structure in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA. [19] [20]

Non-coding Edit

Non-coding sequences important for gene regulation, such as the binding or recognition sites of ribosomes and transcription factors, may be conserved within a genome. For example, the promoter of a conserved gene or operon may also be conserved. As with proteins, nucleic acids that are important for the structure and function of non-coding RNA (ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, and base pairs that contribute to structure or function are often conserved instead. [21] [22]

Conserved sequences are typically identified by bioinformatics approaches based on sequence alignment. Advances in high-throughput DNA sequencing and protein mass spectrometry has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s. [23] [24]

Homology search Edit

Conserved sequences may be identified by homology search, using tools such as BLAST, HMMER, OrthologR, [25] and Infernal. [26] Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from multiple sequence alignments of known related sequences. Statistical models such as profile-HMMs, and RNA covariance models which also incorporate structural information, [27] can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as PAM and BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range. [28]

Multiple sequence alignment Edit

Multiple sequence alignments can be used to visualise conserved sequences. The CLUSTAL format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ) [30] Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height. [29]

Genome alignment Edit

Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and scalability of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes. [32] However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible. [33] [34]

Scoring systems Edit

Other approaches use measurements of conservation based on statistical tests that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate.

The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence. [35] [36]

LIST [37] [38] (Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA.

Aminode [39] combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to purifying selection and are typically critical for normal protein function.

Other approaches such as PhyloP and PhyloHMM incorporate statistical phylogenetics methods to compare probability distributions of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a phylogenetic tree. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a likelihood-ratio test or score test. P-values generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses hidden Markov models to generate probability distributions. The PhyloP software package compares probability distributions using a likelihood-ratio test or score test, as well as using a GERP-like scoring system. [40] [41] [42]

Ultra-conserved elements Edit

Ultra-conserved elements or UCEs are sequences that are highly similar or identical across multiple taxonomic groupings. These were first discovered in vertebrates, [43] and have subsequently been identified within widely-differing taxa. [44] While the origin and function of UCEs are poorly understood, [45] they have been used to investigate deep-time divergences in amniotes, [46] insects, [47] and between animals and plants. [48]

Universally conserved genes Edit

The most highly conserved genes are those that can be found in all organisms. These consist mainly of the ncRNAs and proteins required for transcription and translation, which are assumed to have been conserved from the last universal common ancestor of all life. [49]

Genes or gene families that have been found to be universally conserved include GTP-binding elongation factors, Methionine aminopeptidase 2, Serine hydroxymethyltransferase, and ATP transporters. [50] Components of the transcription machinery, such as RNA polymerase and helicases, and of the translation machinery, such as ribosomal RNAs, tRNAs and ribosomal proteins are also universally conserved. [51]

Phylogenetics and taxonomy Edit

Sets of conserved sequences are often used for generating phylogenetic trees, as it can be assumed that organisms with similar sequences are closely related. [52] The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial phyla in metagenomics studies. [53] [54] Sequences that are conserved within a clade but undergo some mutations, such as housekeeping genes, can be used to study species relationships. [55] [56] [57] The internal transcribed spacer (ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classify fungi and strains of rapidly evolving bacteria. [58] [59] [60] [61]

Medical research Edit

As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of genetic diseases. Many congenital metabolic disorders and Lysosomal storage diseases are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as mice [62] or fruit flies, [63] and studying the effects of knock-outs of these genes. [64] Genome-wide association studies can also be used to identify variation in conserved sequences associated with disease or health outcomes. In Alzehimer's disease there had been over two dozen novel potential susceptibility loci discovered [65] [66]

Functional annotation Edit

Identifying conserved sequences can be used to discover and predict functional sequences such as genes. [67] Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as Pfam and the Conserved Domain Database can be used to annotate functional domains in predicted protein coding genes. [68]