I was looking at the GC content percentages of few organisms. I also know calculating the GC content percentage. But, what I want to know is, what information would we get., let us suppose if human genome has 40% of GC in the genome.
Does it help us compare the number of genes between different species, for example, a bacteria has less GC content than human, so does it mean Human produce more genes than bacteria?
or Is it that the 40% of the genome,has coding regions in them and more genes are found in those regions? What is the number "40%" actually indicate us?
Please help me understanding the concept. Would be thankful your answers and suggested reading.
EDIT: My question is what does knowing percentage of GC content help us knowing? If the GC content is 40%, does it correlate to number of genes in the genome? Regards, Prakki Rama.
Interesting question. The GC-content seems to evolve over time and it also seems that the GC-content of coding regions is higher than for the surrounding non-coding regions (see reference 1). If there is a specific function for this higher GC-content or not is (if I understand this right) debated among the groups which do research in this field. Have a look at the references (and probably also their references) to decide on this:
- Both selective and neutral processes drive GC content evolution in the human genome.
- Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution.
- Recombination Drives the Evolution of GC-Content in the Human Genome
- GC-Content Evolution in Mammalian Genomes: The Biased Gene Conversion Hypothesis
- Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes
A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes
Correlations between genome composition (in terms of GC content) and usage of particular codons and amino acids have been widely reported, but poorly explained. We show here that a simple model of processes acting at the nucleotide level explains codon usage across a large sample of species (311 bacteria, 28 archaea and 257 eukaryotes). The model quantitatively predicts responses (slope and intercept of the regression line on genome GC content) of individual codons and amino acids to genome composition.
Codons respond to genome composition on the basis of their GC content relative to their synonyms (explaining 71-87% of the variance in response among the different codons, depending on measure). Amino-acid responses are determined by the mean GC content of their codons (explaining 71-79% of the variance). Similar trends hold for genes within a genome. Position-dependent selection for error minimization explains why individual bases respond differently to directional mutation pressure.
Our model suggests that GC content drives codon usage (rather than the converse). It unifies a large body of empirical evidence concerning relationships between GC content and amino-acid or codon usage in disparate systems. The relationship between GC content and codon and amino-acid usage is ahistorical it is replicated independently in the three domains of living organisms, reinforcing the idea that genes and genomes at mutation/selection equilibrium reproduce a unique relationship between nucleic acid and protein composition. Thus, the model may be useful in predicting amino-acid or nucleotide sequences in poorly characterized taxa.
A longstanding open problem in biology is why the G+C (guanine plus cytosine) content of DNA varies so much across taxonomic groups. In theory, the amounts of the four bases in DNA (adenine, guanine, cytosine, and thymine) should be roughly equal, and regression to the mean should drive all organisms to a genomic G+C content of 50%. That's not what we find. In some organisms, like Mycobacterium tuberculosis, the G+C content is 65%, whereas in others, like Clostridium botulinum (the botulism organism) the G+C content is only 28%.
We know that, in general, G+C content correlates (not perfectly, though) with large genome size, in bacteria. Very low G+C content usually means a smaller genome size, and in fact tiny intracellular parasites and symbionts like Buchnera aphidicola (the aphid endosymbiont) have some of the lowest G+C contents of all (at 23%).
It's not hard to understand the presence of low-GC organisms, since it's well known that most transition mutations are GC-to-AT transitions. The high prevalence of mutations in the direction of A+T has often been called "AT drift."
But some organisms go the other way, developing unusually high G+C content in their genomes, indicating that something must be counteracting AT drift in those organisms.
Recently, a group of Chinese scientists (see Wu et al., "On the molecular mechanism of GC content variation among eubacterial genomes," Biology Direct, 2012, 7:2) has advanced the notion that high G+C content is due, specifically, to the presence of the dnaE2 gene, which codes for a low-fidelity DNA repair polymerase. This gene, they say, drives A:T pairs to become G:C pairs during the low-fidelity DNA repair that goes on in certain bacteria in times of stress. Not all bacteria contain the dnaE2 polymerase. Wu et al. discuss their theory in some detail in a .January 2014 article in the ISME Journal.
In earlier genomic studies of my own, I curated a list of 1373 eubacterial species (in which no species occurs twice), spanning a wide range of G+C values. When I learned of the dnaE2 hypothesis of Wu et al., I decided to check it against my own curated collection of organisms.
The first thing I did was go to UniProt.org and do a search on dnaE2. Some 1882 hits came back in the search, but many hits were for proteins inferred to be DNA polymerase III alpha subunits, not necessarily of the dnaE2 variety. In order to eliminate false positives, I decided to restrict my search to just bonafide dnaE2 entries that have been reviewed. That immediately cut the number of hits down to 77.
But among the 77 hits, some species were listed more than once (due to entries for multiple strains of the organism). I unduplicated the list at the species level and ended up with 60 unique species.
|Click image to enlarge. In this plot, genome A+T content (a taxonomic metric) is on the x-axis and coding-region purine content is on the y-axis. (N=1373) The points in red represent organisms that possess a dnaE2 error-prone polyerase. See text for discussion.|
This graph plots A+T content (which of course is just one minus the G+C content) on the horizontal axis, against coding-region purine content (A+G) on the vertical axis. (For more information on the significance of coding-region purine content, see my previous posts here and here. It's not important, though, for the present discussion.) Notice that the red points tend to occur on the left side of the graph, in the area of high G+C (low A+T) content. The red dot furthest to the right represents the genome of Saccharophagus degradans. Only 6 out of 47 dnaE2-positive organisms have G+C content below 50% (A+T above 50%). The rest have genomes rich in G+C.
This is, of course, just a quick, informal test (a "sanity check," if you will) of the Wu hypothesis regarding dnaE2 (which is a repair polymerase not needed for normal DNA replication, nor possessed by all bacteria). Various types of sampling errors could invalidate these results. Also, the Wu hypothesis itself is open to criticism on the grounds that correlation does not prove causation. Nevertheless, it's an interesting hypothesis and a random check of 47 dnaE2-positive species in my collection of 1373 organisms tends to provide at least anecdotal verification of the Wu theory that dnaE2 causes drift toward high G+C content.
Of course, Wu's theory does not explain the wide range of G+C contents observed in organisms other than bacteria. (There is no dnaE2 in eukaryotes, for example.) The general notion, however, that genomic G+C content tends to be a reflection of the components of a cell's "repairosome" (the enzyme systems used in repairing DNA) has substantial merit, I think. On that score, be sure to see my earlier analysis of how the presence or absence of an Ogg1 gene influences coding-region purine content.
Here, by the way, are the 47 dnaE2-containing organisms that show up as red dots in the graph above:
Amino acid composition reflects the usage of twenty standard amino acids in proteins. Understanding the changes of amino acid composition among homologous proteins is key to the investigation of protein functioning, as the proteins can acquire new functions through amino acid substitutions (Misawa et al., 2008). The amino acid compositions vary among proteins, even among those homologous proteins. The amino acid composition was reported to be correlated with the protein structure classes (Bahar et al., 1997 Horner et al., 2008 Du et al., 2014), the metabolic efficiency (Akashi and Gojobori, 2002 Kaleta et al., 2013), and the translation efficiency (Du et al., 2017). Sueoka (1961, 1962) firstly reported that there is a correlation between GC contents and amino acid composition of proteins, and then the nucleotide bias causes the biased amino acid usage in bacterial and viral genomes was broadly reported (Rooney, 2003 Bohlin et al., 2013 Goswami et al., 2015). Cost-minimization could also shape the amino acid composition (Seligmann, 2003 Raiford et al., 2008 Bivort et al., 2009). Another factor, which influences the amino acid gain and loss in protein evolution and thus causes the biased amino acid usage is the order of amino acids being recruited into the genetic codes (Jordan et al., 2005 Hurst et al., 2006 Mcdonald, 2006 Liu et al., 2015). However, we still do not know how the feature of amino acids contributes to shape their biased compositions in proteins.
Life emerged and has been evolving, and the imprint of evolution is recorded by the genomes (Martin et al., 2016). If the amino acid composition of early life is known, it is possible to infer the factors that cause the biased amino acid usage of proteins during the evolution process. Brooks and Fresco (2002) analyzed the amino acid frequencies in extant proteomes and found that the frequencies of several amino acids increased since the divergence of the last universal common ancestor (LUCA). The LUCA, which could be inferred by comparing the genomes of its descendants, is the most recent ancestor from which all currently living species have evolved. Weiss et al. (2016) traced the LUCA by phylogenetic criteria and identified a set of genes from 355 families, which implies a very specific lifestyle. This work places clostridia and methanogens as the earliest-diverging organisms, which provides us with a very intriguing insight into the LUCA (Mcinerney, 2016). The hydrogenotrophic methanogenic archaeon Methanococcus maripaludis S2 (MmarS2) is a well-studied organism which is anaerobic, H2-dependent and uses the Wood-Ljungdahl pathway (Goyal et al., 2014). Thus, it is possible to choose this organism as one representative of LUCA to investigate the variation of amino acid frequencies.
Because most essential genes are ancient and more evolutionary conserved (Jordan et al., 2002 Chen et al., 2010), we used essential genes as a representative set of ancient genes and observed the amino acid composition of corresponding proteins homologous to those proteins of MmarS2. Firstly, it is shown that in these protein coding genes GC contents have more significant effects on the amino acid deviation than the amino acid recruitment order with LUCA protein and non-LUCA proteins. Secondly, the gain and loss of amino acids for these homologous proteins do not accord well with the amino acid recruitment orders. Thus, the GC variations have more effects on the amino acid usage bias than the recruitment order of amino acids. The GC content influence the amino acid composition maybe caused by the energy efficiency.
Nonradiometric Evidence for an Old Earth
Dendrochronology is the use of data from annual tree rings to date samples. Tree rings are wider in wetter years and narrower in dryer years. In trees from overlapping periods, patterns of variations in ring thickness through the years can be used to correlate periods represented by rings from different trees, and periods can even be matched in trees from different locations. Using such correlation methods with European trees, dendrochronologists have pieced together a continuous tree-ring sequence 12,410 years long (Friedrich et al., 2004), demonstrating that Earth is much older than 6000 years.
Varves are thin sedimentary layers, typically a few millimeters and often <1 mm thick, that are deposited annually in lakes and certain marine environments, as suspended particles settle slowly to the bottom. Typically, a varve consists of a light-colored layer deposited during spring and summer and a darker layer deposited during autumn and winter. The difference in colors is due to a higher accumulation of the shells of microscopic organisms during the spring and summer months, when these organisms are more abundant (Goslar et al., 1995 Thunell et al., 1995 Kitagawa & van der Plicht, 1998). Series of chemical differences in varves from one year to the next can be matched in sediments from different lakes. This allows correlation between varve sequences of different ancient lakes. Using this method a continuous sequence of varves representing
13,000 years has been constructed from sediments of ancient Swedish lakes (Wohlfarth et al., 1995).
In some cases, sediment from a single lake yields thousands of varves. A continuous series of 9662 varves is known from the sediment at the bottom of Lake Gósciąż in Poland (Goslar et al., 1995). A series of >12,000 varves is known from North America’s Lake Erie (Sears, 1948). A continuous series of 29,100 varves comes from Lake Suigetsu in Japan (Kitagawa & van der Plicht, 1998). These lakes have therefore been accumulating sediment for at least 9662, 12,000, and 29,100 years, respectively, which can only have happened if the Earth is at least that old.
The varve series mentioned above are from sediments that represent only the Holocene and Pleistocene epochs of the Neogene Period (see Figure 1). Varves in sedimentary rock from some earlier periods record even longer stretches of time. The Green River Formation, a lake deposit in Wyoming, Colorado, and Utah from the Eocene Epoch, contains too many varves to count. Using average varve thicknesses in various beds of this formation and the total thickness of those beds, Bradley (1929) calculated that
6.5 million varves are present, which indicates that the lake accumulated sediment for 6.5 million years. Using the same method, Stamp (1925) calculated that
2.17 million varves are present in an Oligocene–Miocene deposit in Myanmar, and Rubey (1930) calculated that
2 million varves are present in the Upper Cretaceous Graneros Shale, a marine deposit from the American Midwest. Varves therefore provide evidence that Earth is millions of years old.
A typical young-Earth creationist objection to varve evidence for long expanses of time is that a large number of thin layers can represent a short time span for example, ash layers produced during a single day by the eruption of Mount St. Helens contains many fine laminations (Whitmore, 2008). However, this objection is nonsensical, because volcanic ash laminations are not varves. Volcanic ash lacks the shells of aquatic microorganisms that color the summer layer of a varve, and experiments using sediment traps demonstrate that a single varve takes a year to accumulate (Thunell et al., 1995).
Polar ice has annual growth layers that are visually identifiable. In an ice core from Greenland, 40,500 such layers were visually counted (Alley et al., 1993), showing that the area has been accumulating ice – and has therefore existed – for more than 40,000 years. Alley et al. (1993) confirmed that the layers were indeed annual by counting not only the visible boundaries of the layers but also the variations in dust accumulation and chemical properties of Arctic ice that are known to vary annually. Other Arctic ice cores record time spans of
Polar ice cores show patterns of changes in chemical signatures, dust accumulation, and pollen accumulation that vary across centuries and can be matched from one ice core to the next. Such changes can also be matched with corresponding changes across tree rings in the dendrochronological record and across varves in the lake sediment record. The ice record, the dendrochronological record, and the lake sediment record all record the same number of years between given climatic events. Each dating method therefore confirms the accuracy of the other. For example, all three methods confirm that average temperatures rose dramatically in Europe about 11,450–11,390 years ago (Björck et al., 1996). That the time estimates produced by such methods are correct is confirmed by the presence, in ice layers from the expected periods, of fallout from volcanic eruptions of known times (Johnsen et al., 1992).
An ice core from Lake Vostok, Antarctica, records a much longer span of time than 40,000 years. The Vostok ice core is 3623 m deep, approximately half again the depth of the 40,000-year Greenland ice cores, which are
2300 m deep (Johnsen et al., 1992 Alley et al., 1993). Annual ice layers at great depths are compressed into smaller thicknesses by the weight of the overlying layers. Using known values for the magnitude of such compression, the time span recorded by the Vostok ice core is estimated as
The overall nucleotide composition of an organism’s genome varies greatly between species. Previous work has identified certain environmental factors (e.g., oxygen availability) associated with the relative number of GC bases as opposed to AT bases in the genomes of species. Many of these environments that are associated with high GC content are also associated with relatively high rates of DNA damage. We show that organisms possessing the non-homologous end-joining DNA repair pathway, which is one mechanism to repair DNA double-strand breaks, have an elevated GC content relative to expectation. We also show that certain sites on the genome that are particularly susceptible to double strand breaks have an elevated GC content. This leads us to suggest that an important underlying driver of variability in nucleotide composition across environments is the rate of DNA damage (specifically double-strand breaks) to which an organism living in each environment is exposed.
Citation: Weissman JL, Fagan WF, Johnson PLF (2019) Linking high GC content to the repair of double strand breaks in prokaryotic genomes. PLoS Genet 15(11): e1008493. https://doi.org/10.1371/journal.pgen.1008493
Editor: Xavier Didelot, University of Warwick, UNITED KINGDOM
Received: August 16, 2019 Accepted: October 25, 2019 Published: November 8, 2019
Copyright: © 2019 Weissman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data used came from public repositories. Completely sequenced prokaryotic genomes were from NCBI’s non-redundant RefSeq database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/). Relationships between prokaryotes were from the SILVA Living Tree (https://www.arb-silva.de/projects/living-tree/). Clusters of related genomes were from the Alignable Tight Genomic Cluster (ATGC) database (http://dmk-brain.ecn.uiowa.edu/ATGC/). Prokaryotic trait data were from the ProTraits database (http://protraits.irb.hr/). Linkages between genomes and restriction enzymes were from the REBASE database (http://rebase.neb.com/rebase/rebase.html). Intermediate data files and code may be found at: https://github.com/jlw-ecoevo/gcku.
Funding: JLW was supported by a GAANN Fellowship from the U.S. Department of Education and the University of Maryland as well as a COMBINE Fellowship from the University of Maryland and funded by NSF DGE-1632976. WFF was partially supported the U.S. Army Research Laboratory and the U.S. Army Research Office under Grant W911NF-14-1-0490. PLFJ was supported in part by NIH R00 GM104158. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Mastering Biology: Chapter 1
They recycle within the ecosystem, being constantly reused.
They exit the ecosystem in the form of heat.
They flow through the system, losing some nutrients in the process.
An ecosystem displays complex properties not present in the individual communities within it.
An understanding of the interactions between different components within a living system is a key goal of a systems biology approach to understanding biological complexity.
Understanding the chemical structure of DNA reveals how it directs the functioning of a living cell.
All organisms, including prokaryotes and eukaryotes, use essentially the same genetic code.
The forelimbs of all mammals have the same basic structure, modified for different environments.
The structure of DNA is the same in all organisms.
The scientific method is a procedure used to search for explanations of nature. The scientific method consists of making observations, formulating hypotheses, designing and carrying out experiments, and repeating this cycle.
Observations can be either quantitative or qualitative. Quantitative observations are measurements consisting of both numbers and units, such as the observation that ice melts at 0∘C . In contrast, qualitative observations are observations that do not rely on numbers or units, such as the observation that water is clear.
A hypothesis is a tentative explanation of the observations. The hypothesis is not necessarily correct, but it puts the scientist's understanding of the observations into a form that can be tested through experimentation.
Experiments are then performed to test the validity of the hypothesis. Experiments are observations preferably made under conditions in which the variable of interest is clearly distinguishable from any others.
If the experiment shows that the hypothesis is incorrect, the hypothesis can be modified, and further experiments can be carried out to test the modified hypothesis. This cycle is repeated, continually refining the hypothesis.
If a large set of observations follow a reproducible pattern, this pattern can be summarized in a law—a verbal or mathematical generalization of a phenomenon. For example, over the years people observed that every morning the sun rises in the east, and every night the sun sets in the west. These observations can be described in a law stating, "The sun always rises in the east and sets in the west."
1. The GC content of DNA varies by species, and it varies a lot.
2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.
If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes extremophiles and blandophiles pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).
|Genome size correlates loosely with GC content. The very smallest |
bacteria tend to have AT-rich (low GC%) DNA.
Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.
And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.
Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?
I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival that's my main assumption.
My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.
But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.
The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.
|Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.|
I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.
Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)
The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.
The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)
|This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.|
Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 6.75% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 6.75%.)
This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)
Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.
Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.
I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.
Significances of incorporating GC and purine contents into models
Empirical relationships between GC content and codon (amino acid) usage have been widely reported but explained in most of the cases less comprehensively. Here we show that each codon as well as each nucleotide in cellular genomes follows a very similar trend when GC content varies (Figures 2, 3, 4 and Additional files 2, 3, 4), albeit lesser differences between prokaryotes and eukaryotes due to their sequence heterogeneity (for example, isochores in vertebrates [42, 43], integral membrane proteins with hydrophobic nature, horizontal transfer of DNA and questionable predicted coding regions, etc.). Our results strongly suggest that mutation and selection not only act at different levels but also exhibit different priorities that are attributable to the organization of the genetic code [44, 45]. At the nucleotide level, we observe that the compositions of all species for a given GC content are very similar and more or less predictable. Consequently, GC content becomes a significant predictor for nucleotide, codon, and amino acid compositions, since half of the amino acids are rather GC content-sensitive in their first and second codon positions [44–47]. However, it does not mean that GC content, varying from 17% to 75%, is the sole determinant of compositions at all levels [31, 48] purines have been widely reported to have a determinative role in amino acid physicochemical properties and purines in the second codon position may control the charge and hydrophobicity of amino acids [44, 46, 49, 50]. Similar to GC content, purine content also differs from one species to another, albeit with a relatively smaller range in a nearly 10% deviation below or above the half line. In bacteria, for instance, the minimum of purine content is 48.0% for Clavibacter michiganensis subsp. michiganensis NCPPB 382, whereas the maximum is 58.8% for Clostridium tetani E88. The slight deviation of purine content, indicating a complex interplay of mutation and selection and reflecting an important balance between the purine-sensitive and insensitive amino acids--15 and 5 (as signified by their codons' sensitivity to purine variations at the third codon position), respectively --can give rise to completely different compositions at the levels of both codons and amino acids (as indicated in Equations 1-8).
Therefore, our models first adopt GC and purine contents as two important compositional elements and consider heterogeneous mutation and selection forces acting at all three codon positions. As testified across a wide variety of species, the models provided consistent compositions, quantitatively recapturing the empirical relationships with changing GC and purine contents. Our results, especially in the various changing trends (most of them are not linear) further validated that mutations (dominated by GC content variations) and selections (dominated by purine content variations) mainly act at the level of nucleotides rather than codons or amino acids in accordance with previous studies [12, 41, 52]. Although our models are designed to work on protein coding sequences, it might also be applicable to nucleotide frequencies in non-coding sequences as an alternative. Second, the deviations from the dominant trends for certain amino acids, to a lesser extent some of their codons (such as it is well-accepted that purine-rich sequences often serve as elements of exonic enhancers among animal genes that have multiple spliceosomal introns), reflect selection forces acting primarily on certain amino acids of the proteomes when their amino acid sequence changes interfere with protein level functions. Third, there are other balancing forces buried in the organization of the genetic code. One of the sets includes the six-fold codons for Leu, Arg (arginine), and Ser (serine). All of them provide diverse balances for purine content variations as they are all divided between the purine-sensitive and insensitive codons [44, 46]. Although four of the codons for Arg are in the GC-rich quarter of the genetic code, its counterpart, Lys (lysine) has all its codons in the AT-rich quarter in order to maintain enough basic amino acids in the proteomes .
Our models have several variants. Since they are built on the basis of GC and purine contents and thus symbolized as The expected compositions predicted by our models, however, sometimes deviate in various degrees from the observed. Such deviations can be caused by complex evolutionary mechanisms (e.g., extreme dinucleotide abundance ) and deciphered in terms of mutation and selection [54, 55] mutation towards a particular nucleotide content (e.g., GC content) primarily determines codon and amino acid usage according to the genetic code structure  and selection essentially caters for a given amino acid usage . Therefore, it is likely that these composition deviations provide implications for molecular evolution. Considering nucleotide compositions at all three codons positions (Figure 2 and Additional file 2), four nucleotides at the first and third codon positions deviated evenly, suggesting stronger mutation effects. On the contrary, four nucleotides at the second codon position deviated remarkably, exhibiting a similar manner in all species. As compared to the expected compositions, A and C appear overestimated, whereas G and T are underestimated (Figure 2 and Additional file 2). This indicates the strong selection acting at the second position that is intrinsic to the organization of the genetic code amino acids that have A or C at their second codon positions are more diverged and less flexible toward nucleotide changes across codon positions than within codon positions. Conversely, the amino acids that have G or T at their second codon positions are relatively relaxed toward nucleotide changes across codon positions. Most noticeable are Leu and Arg, whose codons are partitioned within the same position but between the purine-sensitive and insensitive halves (Additional file 5) . Our results are in agreement with previous observations [41, 44, 58]. Since selection forces largely act at the levels of amino acids and their codons, we are able to assess the degrees of selection in different organisms by calculating subtle differences among amino acid (codons) conversion matrices. For instance, Ala and Val (valine) are the two most departed amino acids in all the collected sequences. Namely, in comparison to expectations, there are a surplus of alanine and a deficit of valine. Since amino acids are exchanged at different frequencies due to their compositional relevance at nucleotide level, it is possible that deviations of these two amino acids are highly related to such exchangeability. Therefore, we constructed five amino acid exchange matrices that are based on five different datasets in Escherichia coli, fruit fly, rice, yeast, and mammal (see Methods). When we take the top 10 highly-exchangeable pairs in all five matrices, the four among the top are (1) Ala ↔ Ser, (2) Ala↔ Thr (threonine), (3) Ala ↔ Val, and (4) Val ↔ Ile (isoleucine) (Additional file 6). As we know, amino acids with similar physicochemical properties tend to be more exchangeable [59–62]. It appears that Ala is the most active amino acid, primarily due to the fact that several of its neighboring amino acids have similar physicochemical properties (such as their size parameters). With regard to the exchange between Val and Ile, it is their similarity in hydrophobicity that plays a key role. These results are by and large consistent with findings in several previous studies [12, 63, 64]. Therefore, our models bear significance in establishing a theoretical framework for compositional analysis and providing clues for molecular evolution studies. . List of species. Table S2. Phyla representation. Table S3. Genomic and environmental properties. Figure S1. Correlations of traits with ΔLFE are not present in its individual components. Figure S2. The ΔLFE profile is more conserved than other genomic traits. Figure S3. Local CUB vs. Local ΔLFE. Figure S4. Comparison between ΔLFE calculated using CDS-wide and position-specific (“vertical”) randomizations. Figure S5. ∆LFE is stronger in highly expressed genes and genes encoding for highly abundant proteins. Figure S6. Unsupervised discovery of profile regions. Figure S7. ΔLFE profiles for all species. Figure S8. Comparison between ΔLFE profiles in different domains. Figure S9. Autocorrelation between ΔLFE profile regions. Figure S10. Trait correlations in taxonomic subgroups. Figure S11. Correlation of ∆LFE with different genomic measures of CUB is consistent. Figure S12. ENc’ correlates with ΔLFE magnitude, not shape. Figure S13. Genomic-GC and genomic-ENc’ both predict ΔLFE. Figure S14. Endosymbionts have weaker ΔLFE. Figure S15. Range robustness for GLS regressions between ΔLFE and related traits. Figure S16. Additional controls for phenomenon related to translation initiation. Figure S17. Dependence of ΔLFE profiles on temperature. Species ΔLFE profiles and additional data used for GLS regression analysis. Processed ultrametric phylogenetic tree used for GLS regression analysis.
, their variants can also be represented by S and R: , , . As assumed, S and R is an independent pair, which leads to S c and R, S and R c , S c and R c are also independent pairs (see Models). Therefore, the variants, , , , are in essence equivalent to our models.
Implications of composition deviations
Additional file 1: Table S1
Additional file 2.
Additional file 3.
The expected compositions predicted by our models, however, sometimes deviate in various degrees from the observed. Such deviations can be caused by complex evolutionary mechanisms (e.g., extreme dinucleotide abundance ) and deciphered in terms of mutation and selection [54, 55] mutation towards a particular nucleotide content (e.g., GC content) primarily determines codon and amino acid usage according to the genetic code structure  and selection essentially caters for a given amino acid usage . Therefore, it is likely that these composition deviations provide implications for molecular evolution.
Considering nucleotide compositions at all three codons positions (Figure 2 and Additional file 2), four nucleotides at the first and third codon positions deviated evenly, suggesting stronger mutation effects. On the contrary, four nucleotides at the second codon position deviated remarkably, exhibiting a similar manner in all species. As compared to the expected compositions, A and C appear overestimated, whereas G and T are underestimated (Figure 2 and Additional file 2). This indicates the strong selection acting at the second position that is intrinsic to the organization of the genetic code amino acids that have A or C at their second codon positions are more diverged and less flexible toward nucleotide changes across codon positions than within codon positions. Conversely, the amino acids that have G or T at their second codon positions are relatively relaxed toward nucleotide changes across codon positions. Most noticeable are Leu and Arg, whose codons are partitioned within the same position but between the purine-sensitive and insensitive halves (Additional file 5) . Our results are in agreement with previous observations [41, 44, 58].
Since selection forces largely act at the levels of amino acids and their codons, we are able to assess the degrees of selection in different organisms by calculating subtle differences among amino acid (codons) conversion matrices. For instance, Ala and Val (valine) are the two most departed amino acids in all the collected sequences. Namely, in comparison to expectations, there are a surplus of alanine and a deficit of valine. Since amino acids are exchanged at different frequencies due to their compositional relevance at nucleotide level, it is possible that deviations of these two amino acids are highly related to such exchangeability. Therefore, we constructed five amino acid exchange matrices that are based on five different datasets in Escherichia coli, fruit fly, rice, yeast, and mammal (see Methods). When we take the top 10 highly-exchangeable pairs in all five matrices, the four among the top are (1) Ala ↔ Ser, (2) Ala↔ Thr (threonine), (3) Ala ↔ Val, and (4) Val ↔ Ile (isoleucine) (Additional file 6). As we know, amino acids with similar physicochemical properties tend to be more exchangeable [59–62]. It appears that Ala is the most active amino acid, primarily due to the fact that several of its neighboring amino acids have similar physicochemical properties (such as their size parameters). With regard to the exchange between Val and Ile, it is their similarity in hydrophobicity that plays a key role. These results are by and large consistent with findings in several previous studies [12, 63, 64]. Therefore, our models bear significance in establishing a theoretical framework for compositional analysis and providing clues for molecular evolution studies.
. List of species. Table S2. Phyla representation. Table S3. Genomic and environmental properties. Figure S1. Correlations of traits with ΔLFE are not present in its individual components. Figure S2. The ΔLFE profile is more conserved than other genomic traits. Figure S3. Local CUB vs. Local ΔLFE. Figure S4. Comparison between ΔLFE calculated using CDS-wide and position-specific (“vertical”) randomizations. Figure S5. ∆LFE is stronger in highly expressed genes and genes encoding for highly abundant proteins. Figure S6. Unsupervised discovery of profile regions. Figure S7. ΔLFE profiles for all species. Figure S8. Comparison between ΔLFE profiles in different domains. Figure S9. Autocorrelation between ΔLFE profile regions. Figure S10. Trait correlations in taxonomic subgroups. Figure S11. Correlation of ∆LFE with different genomic measures of CUB is consistent. Figure S12. ENc’ correlates with ΔLFE magnitude, not shape. Figure S13. Genomic-GC and genomic-ENc’ both predict ΔLFE. Figure S14. Endosymbionts have weaker ΔLFE. Figure S15. Range robustness for GLS regressions between ΔLFE and related traits. Figure S16. Additional controls for phenomenon related to translation initiation. Figure S17. Dependence of ΔLFE profiles on temperature.
Species ΔLFE profiles and additional data used for GLS regression analysis.
Processed ultrametric phylogenetic tree used for GLS regression analysis.