There are lots of different codon optimization approaches out there, and a lot of different qualitative reasons to want optimization (e.g., organism codon bias, GC content, avoiding secondary structure).
One may also want to avoid optimization, however, in order to increase reusability of a component, comparability with prior results, or opportunities for error.
I'd like to know how much I'm likely giving up if I decide not to codon optimize. For example, in my typical work giving up 10% is likely to be lost in the noise, but giving up 2x is worth thinking carefully about, and giving up 10x means it's a must-have.
I know "it depends", and there's a lot of qualitative information out there, but is there any way to get a good rough-order-of-magnitude estimate of the typical expression boost from codon optimization?
I think that there is a problem with how well current codon optimization algorithms actually "optimize" protein expression. I am not aware of any algorithms that work beyond a heuristic level and so for an arbitrary protein, one cannot predict the expression boost that optimization will provide - sometimes yield is significantly decreased. This review covers some of the complexities of optimization really nicely https://doi.org/10.1016/j.molcel.2020.09.014.
Advances in Applied Microbiology
Xiaoyun Su , . Isaac K.O. Cann , in Advances in Applied Microbiology , 2012
8 Optimization of Codon Usage
Codon bias exists in filamentous fungi. An analysis from a genomic perspective revealed an average 57% GC content in the nuclear genes and confirmed an earlier finding that suggested a bias of a cytosine base at the third position in the codons of N. crassa ( Radford & Parish, 1997 ). Another analysis of 45 highly or poorly expressed genes in A. nidulans indicated that, although the GC content of the genome is close to 50%, the codon usage is highly biased to approximately 20 “optimal codons.” These “optimized codons” are characterized by their ending with C or G ( Lloyd & Sharp, 1991 ). In A. awamori and A. niger, a total of 51,434 bp of codons were analyzed. The optimal codon usage has been defined, which helped designing of a synthesized chymosin gene with A. awamori-preferred codons ( Cardoza et al., 2003 ).
The codon usage is tightly associated with recruitment of transfer RNAs (tRNAs) to the ribosome. A general principle is that rare codons (used less than 15%) in consecutive positions or in clusters will lead to inefficient translation ( Kinnaird, Burns, & Fincham, 1991 ). Optimization of codon usage has widely been used in the production of heterologous proteins in bacterial ( Haas, Park, & Seed, 1996 Kane, 1995 ), fungal ( Huang et al., 2008 Sinclair & Choy, 2002 ), plant ( Tregoning et al., 2003 ), and mammalian ( Kim, Oh, & Lee, 1997 Massaer et al., 2001 ) host cells. Overexpression of tRNAs specifically for the rare codons in the host cell (such as the BL21-CodonPlus(DE3)-RIPL from Stratagene Inc.) tend to alleviate ineffective translation, thus enhancing expression of the target gene.
To overcome the hurdle of codon bias in expressing heterologous (or even endogenous) genes in filamentous fungi, more often the codons of the heterologous genes are optimized according to the codon usage of the host cell. By changing 20 codons of a thermophilic xylanase of the bacterium D. thermophilum to those preferred in T. reesei, the xylanase was successfully expressed while the original gene failed to express ( Te’o et al., 2000 ). The native aequorin AeqA from the jellyfish Aequorea victoria was poorly expressed in N. crassa with a yield of 0.15 μg/g total protein due to the presence of 44 rare codons in the gene. After optimization of the codons, the expression rose to 2.26 μg aequorin/g total protein in N. crassa, 13.4 μg aequorin/g total protein in A. niger and 21.8 μg aequorin/g total protein in A. awamori ( Nelson et al., 2004 ). The luc gene encoding the firefly Photinus pyralis luciferase is an invaluable tool for molecular analyses of biological processes, such as circadian rhythm. However, the gene was very poorly expressed in N. crassa. Optimization of the codons for its first 21 residues resulted in successful expression of this protein and facilitated the circadian rhythm study in N. crassa ( Morgan Greene, & Bell-Pedersen, 2003 ). A more intensive optimization of its codons further increased the expression of luciferase by four orders of magnitude ( Gooch et al., 2008 ).
Effect of Codon Optimisation on the Production of Recombinant Fish Growth Hormone in Pichia pastoris
This study was established to test the hypothesis of whether the codon optimization of fish growth hormone gene (FGH) based on P. pastoris preferred codon will improve the quantity of secreted rFGH in culture supernatant that can directly be used as fish feed supplements. The optimized FGH coding sequence (oFGH) and native sequence (nFGH) of giant grouper fish (Epinephelus lanceolatus) were cloned into P. pastoris expression vector (pPICZαA) downstream of alcohol oxidase gene (AOX1) for efficient induction of extracellular rFGH by adding 1% of absolute methanol. The results showed that recombinant P. pastoris was able to produce
of nFGH in one litre of culture supernatant. The total body weight of tiger grouper fingerlings fed with oFGH increased significantly at third (
) and fourth weeks ( ) of four-week experiment period compared to those fed with nFGH. Both oFGH and nFGH significantly enhanced the final biomass and fish survival percentage. In conclusion, codon optimization of FGH fragment was useful to increase rFGH quantity in the culture supernatant of P. pastoris that can be directly used as fish feed supplements. Further studies are still required for large scale production of rFGH and practical application in aquaculture production.
Fish growth hormone (FGH), a protein hormone with molecular weight of 22 kDa, is produced by somatotroph cells in the anterior pituitary gland of fish. It regulates growth and development in fish [1, 2]. Using recombinant FGH as a supplement in the feed, fish growth rate was increased without accumulation of FGH in the fish body [3, 4]. In aquaculture applications, fish growth rate has been increased after using recombinant FGH expressed in E. coli . However, the low capacity of posttranslation process in E. coli resulted in less active recombinant proteins and formation of insoluble inclusion bodies . Insoluble inclusion bodies require more steps in protein purification process and a complicated procedure for protein refolding to recover its biological function. It has also been known that E. coli is a prokaryote and its intrinsic characteristics differ from those of eukaryotes, such as protein processing, protein folding, and posttranslational modifications . Fish growth hormone has been expressed in the yeast Saccharomyces cerevisiae [8–10] and the yeast Pichia pastoris . The yeast P. pastoris expression system offers advantages over S. cerevisiae in its high productivity, efficient secreted expression, and stable genetics, so it has been an attractive candidate for production of foreign proteins . Intracellular expression of FGH in P. pastoris that is used as feed supplement showed significant increase in growth rate on tilapia , but the expression of recombinant FGH was low (1-2% of the total cellular proteins). Interestingly, P. pastoris has a higher secretory capacity and lower expression level of endogenous proteins than other yeasts. Recombinant proteins comprise the majority of the total secreted proteins in the medium . Of note, fish growth hormone cDNA was used in most of previous studies to produce recombinant FGH in different expression systems [5, 8–11]. In our study, we constructed a synthetic FGH gene with preferred codons of P. pastoris in order to increase the expression level of recombinant FGH. Additionally, producing extracellular FGH can omit a purification process and reduce the cost of production in fish farming.
2. Materials and Methods
2.1. Culture Media
For cloning purposes, E. coli strain TOP10 was cultured in low salt LB medium (LSLB) and LSLB-agar with Zeocin (salt concentration <90 mM, pH 7.5 for Zeocin to be active). The solid medium (LSLB-agar) contains 1% peptone, 0.05% NaCl, 0.5% yeast extract, and 1.5% agar with 25 μg/mL Zeocin and the liquid medium was YPD broth, containing 2% peptone, 1% yeast extract, 2% dextrose, and 100 mg/L Zeocin. P. pastoris was cultured on YPDS-agar containing 2% agar and 18% sorbitol. For expression purposes, the buffered complex media, BMMY and BMGY, containing 2% peptone, 1% yeast extract,
biotin, 1.34% yeast nitrogen base, 0.1 M potassium phosphate buffer (pH 6.0), and 1% glycerol (for BMGY growth medium) or 1% methanol (for BMMY induction medium) were used.
2.2. DNA Preparation of FGH Fragments
Total RNA was isolated from the pituitary glands of giant grouper fish (Epinephelus lanceolatus) using RNA extraction kit (Invitogen, The Netherlands). The native FGH (nFGH) gene was reverse-transcripted and amplified using reverse transcriptase kit (Invitrogen, The Netherlands). The nFGH gene fragment was cloned into pGEM-T cloning vector (Promega, USA) and transformed into E. coli strain JM109. DNA sequencing was performed and the sequence was compared with NCBI database (http://www.ncbi.nlm.nih.gov/) for verification.
The codon-optimized FGH (oFGH) sequence was synthesised according to the P. pastoris preferred codons by Invitrogen (http://www.invitrogen.com/genesynthesis). Depending on synthesiser information, the following sequence regions were avoided or amended: (i) very high (>80%) or very low (<30%) GC content, (ii) the cis-acting sequence motifs, such as internal TATA-boxes, chi-sites, and ribosomal entry sites, and (iii) AT-rich or GC-rich sequence that stretches RNA instability motifs repeat sequences and RNA secondary structures splice donor and acceptor sites in higher eukaryotes.
2.3. Construction of Expression Vectors, pPICZαA-nFGH and pPICZαA-oFGH
Construction of recombinant PICZαA-nFGH and PICZαA-oFGH was carried out as descried previously . In brief, the pGEM-T cloning vectors containing nFGH or oFGH gene fragments were digested with EcoRI and NotI restriction enzymes and the fragments with molecular weight of about 600 bp were purified using QIAquick Gel Extraction kit (Qiagen, USA). The purified fragments were cloned separately into a EcoRI- and NotI-digested pPICZαA vector. The recombinant plasmids, pPICZαA-nFGH and pPICZαA-oFGH, were transformed into E. coli strain Top10 of for propagation purpose. These recombinant plasmids were then isolated, sequenced, and linearized with SacI restriction enzyme. Subsequently, linearized pPICZαA-nFGH and pPICZαA-oFGH were introduced into P. pastoris, wild-type X-33 strain using EasySelect Pichia Expression kit (Invitrogen, The Netherlands).
3. Expression in P. pastoris
Production of recombinant FGH in P. pastoris was carried out as described previously . In brief, a single colony of recombinant X-33 harbouring nFGH and oFGH was inoculated in BMGY medium, respectively, and grown at 30°C until OD600 was 2–6. Cell pellet was collected by centrifugation and resuspended in BMMY media (or BGMY medium for control culture) at
. Incubation was continued at 30°C with shaking at 220 rpm. Methanol was added into BMMY medium to a final concentration of 0.5% for 12-hour intervals while glycerol was added to the BMGY medium as a substitute for methanol. Culture supernatants were harvested at 24, 36, 48, 60, and 72 hours after induction and analysed by native SDS-PAGE and SDS-PAGE under denaturing and nondenaturing conditions as described previously . The recombinant protein concentrations in culture supernatant were determined using fish growth hormone, FGH ELISA kit (Cat. no. E0044f, China, http://www.eiaab.com/).
3.1. Bioactivity Test
Tiger grouper fingerlings were divided randomly into four groups, 20 per group (control 1 was fed with common fish feed control 2 was fed with common feed mixed with 5% culture supernatant from wild-type P. pastoris treatment 1 was fed with common feed mixed with 5% culture supernatant from recombinant P. pastoris producing nFGH treatment 2 was fed with common feed mixed with 5% culture supernatant from recombinant P. pastoris producing oFGH). The average body weight of fingerlings was approximately 10.5 g and the average body length was 8.5 cm. Tanks and air stones were cleaned, disinfected, and refilled with new treated fresh water. New treated fresh water was replaced every 7 days for 31 days of experiment period. Feed distribution was done three times per day: 9:00 a.m., 1:00 p.m., and 6:00 p.m. Sampling was carried out every seven days. The supernatant of recombinant yeast culture was mixed with fish feed by 5% of total feed weight.
4. Results and Discussion
Purification and administration methods of recombinant fish growth hormone (FGH) are the main concerns of impracticable use in mass-scale aquaculture. Therefore, we conducted this study to test the hypothesis whether the optimized DNA sequence of FGH will significantly enhance its expression level in P. pastoris in order to be used as fish feed supplement. The entire FGH coding sequence was constructed based on P. pastoris preferred codons (oFGH) while the native FGH (nFGH) gene was obtained by reverse transcription of total RNA that was extracted from fish pituitary gland (Figure 1). The pPICZαA plasmid contains Saccharomyces cerevisiaeα-factor secretion signal peptide for extracellular protein secretion. Both DNA fragments were cloned into P. pastoris expression vector (pPICZαA) downstream α-factor secretion signal peptide and the promoter of alcohol oxidase gene (AOX1) for efficient induction of extracellular FGH production by adding 1% of absolute methanol.
The results showed that the expression of nFGH and oFGH were detected by SDS-PAGE under denaturing conditions at the expected size of 22 kDa (Figure 2). Further analysis by SDS-PAGE under denaturing and nondenaturing conditions showed that both FGH forms (nFGH and oFGH) were produced by P. pastoris as monomers and multimers (Figure 3). After induction of recombinant yeast with methanol, the yeast was grown for 72 hrs, and then the production of FGH was quantified by FGH ELISA assay using standard FGH as a reference. The results showed that recombinant P. pastoris was able to produce mg of oFGH compared to mg of nFGH in one litre of culture supernatant (Figure 4).
In this study, P. pastoris was more efficient in producing oFGH compared to nFGH as secreted protein. It has been shown that codon optimization is important to increase translation rate through the direct use of host cell tRNA pool [16, 17] which ultimately could lead to increasing the recombinant protein quantity [18–20]. Therefore, significant increase was observed in the body weight of tiger grouper fingerlings fed with oFGH at third and fourth weeks of the experiment period compared to those fed with nFGH (Figure 5). This increase in the body weight probably was due to the higher concentration of oFGH compared to nFGH in the culture supernatant ( mg versus mg, resp.). The main outcome of the high content of oFGH in the culture supernatant is enhancing the bioavailability of FGH that essentially stimulates fish appetite as well as feed conversion rate [21, 22].
The DNA sequence used to encode a polypeptide can have dramatic effects on its expression. Lack of readily available tools has until recently inhibited meaningful experimental investigation of this phenomenon. Advances in synthetic biology and the application of modern engineering approaches now provide the tools for systematic analysis of the sequence variables affecting heterologous expression of recombinant proteins. We here discuss how these new tools are being applied and how they circumvent the constraints of previous approaches, highlighting some of the surprising and promising results emerging from the developing field of gene engineering.
► Traditional bioengineering is limited by the lack of systematic methodology. ► Studies of synthetic gene expression have questioned beliefs about gene optimization. ► Current gene synthesis technologies enable a systematic approach to bioengineering. ► Experiments with synthetic genes are refining knowledge of host preferences. ► Rigorous engineering methods will accelerate the advancement of synthetic biology.
While gene expression is usually attributed to transcription rate, the half-lives of mRNAs strongly affect overall mRNA abundances during homeostasis. Attempts at understanding mRNA stability have primarily focused on cis-regulatory elements mainly within the 3′UTR, where a large number of stability determinants (e.g., microRNAs) are known to bind. However, more recent work demonstrates that translation also strongly affects mRNA stability in a codon-dependent manner, indicating that the coding region also contains strong regulatory information [22,23,24,25]. Here, we present a computational model to predict vertebrate mRNA stability based on codon composition (Fig. 1). Our model supports the premise that codon composition is the major determinant of mRNA stability in zebrafish and Xenopus during early embryogenesis (Fig. 2). The degree to which mRNAs are impacted by microRNAs and RNA methylation (m6A) is also dependent on their respective coding sequences in zebrafish and Xenopus embryos, as well as in mouse and human cells (Fig. 4). As such, codon composition can obscure the repressive effects of other regulatory pathways (Figs. 3 and 5). Recently, several studies have aimed to identify novel cis-regulatory elements in 3′UTR regions that are active during zebrafish embryogenesis using reporter mRNAs containing a GFP coding sequence [43,44,45]. Although these methods have been successful in identifying both stabilizing and destabilizing elements in the context of a uniform coding sequence, an accounting of these elements in endogenous transcripts has been largely insufficient to explain stability profiles [43, 44]. We believe that this lack of coherence might be due to the failure to account for codon optimality effects, which impact transcript stability globally. Moreover, we hypothesize that such an accounting will enable the identification of novel regulatory programs resident in the 3′UTR or across the entire mRNA, which may otherwise be obscured by codon composition variability (Figs. 3, 4, 5, and 7).
Model showing that the mRNA stability depends on the regulatory elements of the coding and the 3′UTR, suggesting that to fully understand mRNA stability, the regulatory information across the entire mRNA sequence needs to be integrated, rather than focusing solely on the 3′UTR or in the coding sequence
Our results suggest that miR-430/-427 destabilizing activity can be affected by the coding sequence. The targeting efficacy of miR-430/-427 (zebrafish and Xenopus) is stronger in genes with average codon optimality as opposed to genes either highly enriched in optimal or non-optimal codons (Fig. 5). It can be proposed that different regulatory pathways may recruit common mRNA degradation machinery. For example, mRNAs with a very high content of non-optimal codons may already be targeted for degradation at the maximum rate. Accordingly, other pathways, such as microRNAs, cannot increase the mRNA destabilization. Therefore, we posit that it is imperative to consider the coding sequence when attempting to understand the regulatory power of both canonical and non-canonical miRNA sites [20, 46,47,48].
Our data also indicate that miR-430/-427 can antagonize codon-mediated stabilization during early embryogenesis. Specifically, we find that unstable transcripts enriched in optimal (i.e., stabilizing) codons tend to host miR-430/-427 target sites, suggesting that miR-430/-427 may have been recruited to these mRNAs to counteract the intrinsic “stability” conferred in cis by the coding sequence, perhaps to ensure robust degradation of maternal mRNAs during the MZT (Fig. 6). Of course, from an evolutionary point of view, we cannot rule out the possibility that an enrichment in optimal codons is, in fact, an evolved countermeasure in response to the repressive effects of miR-430 and/or m6A (Fig. 6). Smarca2 is a clear example of a maternal mRNA enriched in optimal codons and containing three miR-430 target sites that serve to destabilize during the MZT . Following from the developmental role and critical clearance of smarca2 during embryogenesis , it will be interesting to dissect the function of maternal mRNAs enriched in optimal codons containing miR-430 target sites. Moreover, it might also be interesting to knock-down maternal mRNAs enriched in non-optimal codons that are actually stable during the MZT . Nonetheless, these results paint a complex picture of how different regulatory mechanisms interact and co-evolve to precisely and temporally modulate mRNA stability.
Future work will aim to understand how the entire mRNA sequence affects stability and how the interplay between codon and cis-regulatory mechanisms impacts embryonic development. Studies employing mRNA reporters aimed at further characterizing cis-regulatory mechanisms (e.g., microRNAs, RNA modifications, RNA binding proteins) should take into account the degree of codon optimization resident within reporter coding sequences (e.g., “optimized” vs “deoptimized” GFP). Pertinent to this design, our lab has developed a web-interphase, iCodon (ideal codon) [50, 51], to optimize or de-optimize coding sequences based on synonymous mutations. Therefore, iCodon could be used to optimize coding sequences (e.g., Covid-19 vaccines), or to de-optimize reporter sequences such as GFP to study cis-regulatory elements in the 3′UTR (e.g., microRNA) using a coding sequence with an average optimality.
Our model predicting gene expression simply takes the codon composition (Fig. 1) and ignores the intrinsic properties of the distribution of the codons across the coding region. For example, the position of the codon can affect codon-mediated mRNA stability in both yeast and zebrafish embryos (Additional file 1: Fig. S2a) [23, 30]. Therefore, in the future, it will be interesting to uncover the grammar/rules of codon-mediated regulation, including the relative positioning (5′ vs 3′) (Additional file 1: Fig. S2c) and ordering of codons. We have recently shown that codon-mediated regulation is dependent on translation initiation rate , which in turn is influenced by sequences residing in the 5′ and 3′UTRs as well as cell conditions (e.g., viral infection) . Moreover, translation of small ORFs in the 5′UTR (uORF) as well as in the 3′UTR (dORF) can also affect translation of the main ORF [53,54,55]. Therefore, in the future, it will be interesting to further characterize the regulatory roles of 5′ and 3′UTRs in shaping translation efficiency as it relates to the codon optimality mechanism. In addition, the availability of tRNAs and charged amino acids has been associated with translation efficiency and mRNA stability in vertebrates [12, 22, 24, 31, 56]. As such, it will be important to integrate tRNA profiles into our model predicting mRNA stability. In sum, our work illuminates the complex crosstalk between cis-regulatory pathways and coding sequences in shaping mRNA stability, as well as highlights the need to develop models that integrate both components. Such models will provide valuable insights into early embryonic development, as well as identify underlying causes for gene misregulation in human disease.
Materials and Methods
Plasmids and Bacterial Strains
E. coli BL21 (DE3) and plasmid pET-30a (+) were used as the expression host and the expression vector for the expression of recombinant proteins. The designed genes (egfp-codon, mApple-codon, egfp-genscript and mApple-genscript) were synthesized by GenScript Corporation (Nanjing, China) and inserted into the plasmid pET-30a(+) with the restriction enzymes (BamHI and HindIII).
The dataset of bacterial genomes with full annotation was downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) on December 8, 2014. All of the selected subspecies are shown in Table S1.
The 346 genomes in Table S1 were used to carry out the phylogenetic analyses. Those bacterial species had at least five sequenced subspecies except the bacterium Bacteroides fragilis NCTC 9343 which was selected as the out-group 38 . The 16 S rDNA sequences of the selected subspecies were collected from the Ribosomal Database Project 40 . All of the 16 S rDNA sequences were aligned with Clustalw 41 . The aligned 16 S rDNA sequences were exported in PHYLIP format for analysis using the PHYLIP set of programs, version 3.696 42 . Similarity matrices of the 16 S rDNA sequences were constructed with the F84 nucleotide substitution model 43, 44 using the Dnadist program in PHYLIP. Phylogenetic trees were constructed with the neighbor-joining method with the Neighbor program in PHYLIP. The reliability of the neighbor-joining tree was estimated by bootstrap analysis using 1000 replication datasets generated by the program Seqboot in PHYLIP. For the codon usage trees, 2000 genes were randomly selected from the target genome file, and the Euclidean distances of the codon usage of the subspecies were calculated to construct the corresponding similarity matrices among all of the different subspecies. The 1000 similarity matrices were created by the random selection method. Then, the program Consense in PHYLIP read all of the constructed trees and generated a consensus tree. Trees were drawn and analyzed with the Dendroscope program, version 3.0 45 .
Data Clustering and Visualization
All of the data in this study were clustered using the open-source software Cluster, version 3.0 46 , and the clustered data were visualized with TreeView, version 1.1.6r4 47 .
Construction and Evaluation of the Codon Prediction Model
The E. coli genome dataset (Table S1) contains 65 genomes, 64 of which were selected to create a codon selection index (CSI), and one of which (E. coli K12_MG1655) was used to evaluate performance of the method. All proteins encoded by the 64 genomes of E. coli were split into window sizes of three, five or seven amino acids. The same amino acid fragments from all the genomes were merged into one CSI vector, which contained the codon usage distribution of the middle amino acid and the average codon usage for each amino acid in the fragment. The genome of E. coli MG1655 was used to train and evaluate the performance of the model. Every gene in the E. coli MG1655 was split into three-, five- or seven-codon windows. Here, the size of the codon window was defined as w. Every short nucleotide sequence was also translated as a short peptide, and then all of the short peptides were searched against the CSI file with the corresponding codon window size. As we wanted to predict the codon selection of the middle amino acid in the peptide, the middle amino acid of the matched peptide and input short peptide must be the same. The matched score (s) between the input peptide and the peptide in the CSI file could be calculated with a BLOSUM62 matrix 48 . The expected maximal score (m) of the input short peptide is the sum of the corresponding diagonal scores in BLOSUM62 matrix of the amino acids in the input peptide. A cutoff (c) could be defined to select the appropriate peptide in the CSI file. Therefore, if the percent (p, p = s/m) of the matched score (s) to the expected maximal score (m) is greater than cutoff c, the matched peptide in the CSI file would be selected to update the corresponding input vector. The final input vector of the short peptide is the arithmetic mean of all the possible matched peptides, and the weight of the matched peptide is the matched score(s). Therefore, if an appropriate cutoff c and window size w were defined, every codon in the gene except the first and last w codons could be represented by one vector, and all of the vector were collected as the input dataset to predict the codon selection in E. coli. As two (methionine and tryptophan) of the 20 amino acids are coded by only one codon, 18 models for each cutoff c and window size w of the E. coli MG1655 genomes were constructed to predict the codon selection with a random forest classifier 49 . The number of trees of the key parameter of the classifier for random forests is 1000. The average overall accuracy and the AUC (Area Under the receiver operating characteristic Curve) for each amino acid model with different parameters were used to evaluate the performance of the method, which was calculated based on a ten-fold cross validation. The AUC was calculated with the package of pROC 50 .
Protein Expression and Purification
The reporter genes (egfp-codon, mApple-codon, egfp-genscript and mApple-genscript) were cloned into a pET30a(+) expression vector and overexpressed in E. coli strain BL21(DE3) pLys. Ten single colonies of the transformed E. coli carrying the reporter gene were cultured in liquid Luria-Bertani medium containing 50 μg/mL kanamycin at 30 °C overnight and then inoculated to fresh auto-induction medium (2:100 dilution) and incubated again at 30 °C with shaking at 750 rpm in incubator 1000 (Heidolph, Germany) 51 . The fluorescence intensity was measured at two-hour intervals using a SpectraMax M2 instrument (Molecular Devices, USA). The excitation and emission wavelengths were 484 and 507 nm and 568 and 592 nm, for eGFP and mApple, respectively 52 . The values shown are the averages of ten independent experiments.
Relative codon bias
The strength of relative codon bias (RCBS) was calculated based on the equation in the references 53, 54 . The protein abundance data of the E. coli was retrieved from paxdB 55 .
For non-commercial purposes, the code of the software Presyncodon can be downloaded from http://www.mobioinfor.cn/presyncodon.
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).
Here, we use reverse ecology to associate genotype (codon optimization) with phenotype (growth rate on galactose) and ecology (isolation environment) across an entire evolutionary lineage (budding yeasts). The conceptual evolutionary model for this association (Fig 5) is that selection for increased rates of galactose metabolism in galactose-rich environments will result in selection for optimization of codon usage in the GAL genes. This selection is likely to continue until codon usage is no longer a barrier to maximum flux allowed through this pathway for a given metabolic load. Therefore, codon optimization not only reflects a mechanistic measure of expression but an evolutionary signal for selection on increased expression.
The ancestral species in environment A maintains the GALactose metabolism pathway at an intermediate codon optimization. Upon introduction to environment B, which contains abundant galactose, there is increased demand for the GALactose metabolism enzymes to take advantage of this energy source. In this new environment, substitutions that increase codon optimization of the GAL genes will be selectively advantageous. Codon optimization will continue to increase under translational selection until it is no longer a barrier to expression or optimal flux through the pathway has been achieved.
By studying a well-known metabolic pathway in a diverse microbial subphylum, we provide a proof of concept for the utility of codon optimization as a genomic feature for reverse ecology. Our discovery of optimization in the GAL pathway in dairy-associated Saccharomycetaceae and human-associated CUG-Ser1 yeasts is consistent with the known functional roles of the enzymes in the pathway. The complete GAL pathway metabolizes galactose, a component of dairy environments, into usable energy . The GAL10 gene affects phenotypes associated with human colonization in CUG-Ser1 yeasts . Similarly, in the Kluyveromyces species found on dairy-associated niches that are able to metabolize lactose into glucose and galactose, there is high optimization in this pathway compared to closely related species not associated with dairy. Interestingly, examination of codon optimization in the gene sets of the 4 Kluyveromyces species studied here would have identified at least K. marxianus as a potential dairy-associated yeast, even in the absence of any knowledge about its isolation environments. Thus, genome-wide examination of codon optimization in fungal, and more generally microbial, species can generate specific hypotheses about metabolic ecology that can be experimentally tested. Our method can also be applied directly to single-cell genomes generated from microbial dark matter known only from DNA . Finally, using an unbiased approach, we identified a strong correlation between optimization in the GAL pathway and other pathways involved in metabolic processing. This novel finding suggests that codon optimization may also be useful for identifying coregulated or correlated pathways in microbial, including fungal, species.
By focusing on a well-characterized pathway, we are able to associate a specific genotype with both a phenotype and ecology. While previous codon-based reverse ecology studies have identified functional categories of genes associated with environments [39–43], we illustrate that this approach can also be useful at the level of individual genes and pathways. It is important to note that this approach may not work for all genes—especially those that have universally high codon usage, such as ribosomal genes and the mannose metabolism gene PMI40. More broadly, our results suggest that codon optimization can be a useful tool for predicting candidate genes and pathways involved in ecological adaptation, which can be subsequently tested experimentally.
As shown, a proximal and strong bottleneck is correlated with an increase in protein abundance. A proximal bottleneck can reduce the number of jammed ribosomes on a transcript. Therefore, it can reduce both the number of occupied ribosomes and the number of delayed ribosomes. Delaying ribosomes on the mRNA might increase their abortion rate, thus causing early termination of the translation , reducing protein levels. For ribosomes to jam, a fast initiation rate is required. This is usually the case in highly expressed genes, in cases of heterologous gene expression, and in synthetic libraries such as discussed here where high protein levels are desired. Due to amino acid sequence constraints for some genes, a naïve approach, using only optimal codons, might result in an unintentional distal bottleneck.
While the bottleneck parameters are correlated with protein abundance, they are not correlated with fitness. This suggests that while the occupation of more ribosomes sequesters them from the cell's pool, for most genes in the GFP library it does not cause a shortage of ribosomes, enabling the cell to continue translating other transcripts. The decrease in fitness is correlated with the increased usage of codons UCA and CAU, suggesting a shortage of the complementary tRNAs.
Our results thus show that, along with mRNA stability, codon choice does affect translation efficiency, and that naïve averaged measures such as CAI and tAI do not capture this regulatory capacity. The results also show that while codon choices do affect both translation efficiency and cell fitness, different aspects of codon selection affect differently the production capacity and costs. One direct conclusion from our results relates to the popular usage of 'His-tags', chains of histidine residues at carboxyl termini of genes in heterologous expression systems . When using carboxy-terminal His-tags in bacterial expression systems it would be advantageous to encode histidine with CAC rather than with CAU for two reasons: first, because CAU appears to correlate negatively with fitness and second, in order to avoid a bottleneck towards the end of the gene.
When trying to understand the cell system, one realizes its processes are regulated on many different levels. As shown in this paper, synthetic gene libraries enabled us to control for a significant portion of gene variability and focus on the effects of regions with less than optimal codons (the bottleneck). Identification of bottleneck effects in synthetic genes thus completes Tuller et al.'s  bioinformatics work that identified clustering of low efficient codons at the beginning of ORFs of natural genes. The results further demonstrate how correlative conclusions made from observations of natural gene sequences can be complemented by synthetic genes, allowing decoding of the sequence features governing the efficiency of translation and it costs.
It is our belief that through carefully designed synthetic libraries many other regulation processes can be understood, thus completing the first step towards understanding the regulation process as a whole.
CODON OPTIMIZATION: ENGINEERING A MORE USEFUL GENE AT THE CODON LEVEL
Many important bioproducts are comprised of proteins, and proteins are made up of amino acids, specified by certain codons in an organism's DNA. This DNA is then transcribed and translated to create these proteins. Since multiple codons can be used to specify an amino acid in E.coli, it is possible to use multiple coding sequences to produce the same chain of amino acids.
The Big Question
Even though genes with different codon preferences can code for the same protein, they may not necessarily do so at the same rate, or lead to expression of that protein at the same level 1 . The question we asked is "How and why are they different?"
To biologists, the answer to this question may shed light on the nature of translation itself, and the reasons that some codons are naturally preferred by the genome of E.coli. To engineers, it could act as an additional point of control over tricky genetic systems, and perhaps more efficient production of useful bioproducts. To undergrads, the pursuit of these answers was the summer of a lifetime.
Degenerate codons do not necessarily lead to equally efficient expression of an amino acid.
Codon optimization is not a new idea, but what makes our project special is that it used several novel criteria for optimization. By picking specific codons for a synthetic gene, we can determine the effect of certain types of codons on translation and ultimately draw conclusions that may be very useful to the biotech community.
To optimize genes we had to find some criteria. What we chose were codons that were generally rare in the overall genome (rare-G1), common in the overall genome (common-G2), abundant in regions of the genome with known fast translation initiation (fast-G3), abundant in regions of the genome with known slow translation initiation (slow-G4), or predicted by advanced software to have slow insertion time (slow insertion time-G5).
Next, we chose a gene to optimize, and came up with superfolder GFP, a well studied reporter gene that was described by the 2008 Cambridge iGEM team (part BBa_I746916). We optimized the gene using our five criteria, then assembled plasmids containing all the necessary parts for the gene to be expressed. The "fast" optimized GFP has been submitted as a part to the registry, as it is the only variant that we have finished characterizing (part BBa_K1506002).
In order to find out more about the translational efficiency of our GFPs, we designed a degenerate ribosome binding site (dRBS) to be inserted in our construct before all five variants. By measuring the expression plateaus at high translation initiation rates (TIR), we will be able to see how efficiently our GFPs are translated. Raising or elimenating plateaus at high TIR will be a marker of how effectively we optimized the gene.
Our final design looked like this:
Includes a promoter, RBS, leader sequence, variant gene, and terminator
Unfortunately, we were unable to progress beyond the cloning phase with G4, and beyond the insertion of dRBS with G1 and G5. Of the two variants that we characterized, only the results for G2 have been sequence verified and analyzed at this point.
Shows the adjusted average fluorescence of the cells that we measured graphed with the strength of the ribosome binding site in that strain.
The results show only vague indication that expression of GFP was increased as TIR increased. Unfortunately, only 14 strains were successfully characterized and sequenced. Characterization of the other GFPs will allow us to determine the success of our optimization.
First, the cloning will be completed. Once all five variants are present in the backbone with the degenerate ribosome binding site inserted, fluorescence data will be collected. Once each measured colony is sequenced, the fluorescence will be graphed alongside the predicted strength of the ribosome binding site. This will give us a good idea of the translational efficiency of each GFP. By comparing fluorescence data for a single ribosome binding site across multiple GFPs, we will be able to determine the effects of our optimization criteria.
1. Subramaniam, Arvind R, Tao Pan, and Philippe Cluzel. “Environmental Perturbations Lift the Degeneracy of the Genetic Code to Regulate Protein Levels in Bacteria.” Proceedings of the National Academy of Sciences of the United States of America 110.6 (2013): 2419–24. Web. 26 May 2014.
Complete Project Information- Beyond the Summary
Codons are groups of three nucleotides that specify a single amino acid, which is then added to a growing polypeptide chain during translation. Even though each codon spefifies only one amino acid, some amino acids are coded by multiple codons. It has been demonstrated that the genome of E.coli shows statistical preference for some of these degenerate codons over others, and it is hypothesized that these codons translate more efficiently than non preferred degenerate codons. We constructed synthetic reporter genes entirely from codons hypothesized to be fast or slow,and characterized them in E.coli. As of right now, we have characterized the level of GFP expression in G3, the fast coding sequence for superfolder GFP. GFP 2, the common coding sequence has been successfully transformed in the plasmid and data is expected to be analyzed for that variant by the giant jamboree. There are still issues with getting a dRBS in the G1, rare, and G5, the calculated slow insertion time. G4, the slow codons, has not been successfully introduced to the backbone, so it has been decided to forego this variant due to time constraints.
Illustrates the principle of codon redundancy. The number in each degenerate codon refers to the criterion that would lead to it being chosen.
Why is this important?
Examples of bioproducts vital to our lives are medicines, fuels, and even industrial chemicals. Codon optimization is important because it gives engineers an additional point of control over protein synthesis.
Our codon optimization research is important for the additional reason that it will help future researchers to develop more comprehensive models of translation. A better understanding of translation is an example of a foundational advance in biology that will lead to faster, more efficient research in many areas of biology. If, for example, our research shows clearly that certain degenerate codons are preferred because they can be translated more efficiently this will allow scientists to search for a mechanism that predicts these effects, and will invite engineers to redesign genes to be translated more efficiently.
Metaphor Alert: Codon optimization can be thought of as an extra dial that can be tuned to rationally control output of genetic systems.
Codon optimization refers to the idea that the individual codons of a gene in a specific organism can be changed in order to alter the behavior of that organism. This relies on an understanding of the central dogma of biology, which states that any organism produces proteins by first transcribing genetic material in the form of DNA to RNA, which is then “read” by ribosomes which produce proteins based on the sequence of amino acids in that RNA. The reading of the RNA is done three nucleotides at a time, and these three letter series of nucleotides are called codons. Codons specify to the ribosome which amino acid to add to a growing amino acid chain.
There are 4 nucleotides, thus 43, or 64 codons are possible. Since there are only 20 amino acids, there is redundancy in the codons, that is, some amino acids are specified by multiple codons. There is no ambiguity, however, meaning that each codon specifies only one amino acid. Codons that code for the same amino acid are called degenerate codons, and even though these degenerate codons code for the same amino acid, they do not necessarily lead to the same expression levels of that amino acid.
Degenerate codons do not necessarily lead to equally efficient expression of an amino acid.
1) Find Criteria for Optimizing Genes in E. coli
All coding sequences were designed so that there would be no difference between the amino acid profile of the variant GFP and the original superfolder GFP. This ensured that each gene led to the expression of the same protein.
Previous researchers have determined through a statistical analysis of the entire genome that some degenerate codons occur more often in protein coding sequences and some are more infrequent. These are referred to as common and rare codons. The importance of this is that protein expression in cells is limited either by either translation initiation rate (TIR) or translation elongation rate, and it is theorized that commonly occurring codons will have faster elongation rates than degenerate rare codons. Translation initiation rate can be artificially controlled by varying the strength of the ribosome binding site (RBS), which consists of the genetic sequence that precedes the protein coding sequences (CDS) of a gene. This is accomplished through the use of the RBS calculator, and in previous research was used to steadily increase the RBS strength of a gene, GFP mut3b, the expression of which was then characterized. Unexpectedly, expression level of proteins plateaued even as the RBS strength (and thus TIR) was increased. By using the RBS Calculator to increase the translation initiation rate, we can detect when the plateau occurs, which is called the "maximum translation rate capacity." Since this plateau occurs independently of TIR, it is theorized that it is due solely to translation elongation becoming a rate limiting step. The design of the GFPs using only common and rare codons was based on the data in this table.
Number in % collumn shows the percent of time that the amino acid was coded for with a specific codon. More frequent codons have higher percentages.
Modified from Maloy, S., V. Stewart, and R. Taylor. 1996. Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY.
To optimize for common degenerate codons, the most frequent codon for a specific amino acid was taken. To optimize for rare degenerate codons, the least frequent codon was taken. For example, if a codon in the original superfolder GFP coded for Phenylalanine, the codons UUU and UUC were available. The frequencies of these were taken from the table (UUU-.51, UUC-.59). For common GFP, UUU was used whenever Phenylalanine was desired, because it had the highest frequency. For rare GFP, UUC was used. Because all codons were found to be either common or rare for E. coli, the common and rare optimized genes had zero commonality. Tryptophan is the only amino acid coded by one codon, but it does not appear in superfolder GFP.
Result of common/rare optimization is two coding sequences with zero commonality.
In another recent project, all the genes (coding DNA sequences) of E. coli are divided into five groups based on the naturally occurring TIR, from lowest to highest. Then, the codon usage profile of each group of genes is statistically analyzed to determine whether a codon is slow or fast. A fast codon is defined as one with high correlation between TIR and its frequency. Otherwise, it is a slow codon. It is hypothesized that the groups of CDS with high TIR will hold more “fast” codons, which will lead to higher translation elongation rate and thus higher protein expression, whereas the slow regions will hold more “slow” codons leading to lower expression. This data is summarized in the following figure.
Codon Frequency in Fast and Slow regions of the Genome
Fast codons show a positive correlation between frequency and TIR, slow codons show a negative correlation
Ng, C. Y., Farasat, I., Zomorrodi, A. R., Maranas, C. D. & Salis, H. M. Model-guided construction and optimization of synthetic metabolism for chemical product synthesis. Synthetic Biology Engineering Research Center Spring Retreat (2013), Berkeley, CA.
Codons whose frequency increases with TIR are defined as fast codons. Those with declining frequency in relation to TIR are slow codons, and those with no correlation are defined as independent of TIR. This can be viewed in the figure above as the slope of the graphs for each codon showing ratio and TIR. If ratio increases with TIR, the codon is fast, and the graph displays positive slope. Slow codons slow negative slope, and TIR independent codons show essentially no slope.
Example of a fast codon. Notice that the codon is more frequent in higher TIR regions of the genome.
In another related research project, researchers developed a program which models the process of translation elongation. This program takes into account the chemical binding of individual codons to ribosomes as well as numerous other relevant biological criteria and is able to predict the time taken by a ribosome to add an amino acid to a growing polypeptide chain. This is known as the “insertion time” for that codon. Using this software, a list of the insertion times for each codon was compiled. It is theorized that codons with longer insertion times will have lower translation elongation rates and thus lower the expression of the protein of the particular CDS that contains the slow codons. This data is summarized in the table below.
Codons with faster insertion times are theorized to lead to higher protein expression.
2) Apply These Criteria to a Reporter Gene (GFP)
It is important to understand the difference between the slow, fast, rare, common, and insertion times criteria. Common and rare codons are based on the frequency of particular degenerate codons in the entire E. coli genome. The hypothesis that common codons will lead to higher expression of proteins is based on the idea that cells have become optimized through evolution to efficiently translate proteins necessary to their survival. Based on this assumption, the most efficient codons will appear more frequently in the overall genome. The fast and slow codon differentiation relies on a very similar analysis. Fast codons are defined as those with a high correlation between frequency and the high TIR regions of the genome, whereas slow codons are those with a high correlation between frequency and the low TIR regions of the genome. This is an extension of the common/rare distinction, but is more specific, as certain parts of the genome with low TIR could possibly code for proteins where high expression (and thus fast translation elongation) is not necessary. By analyzing the codon usage profile of individual regions of the genome with different TIR, it is theorized that the fast and slow codons could be used to artificially control the expression of a particular gene through codon optimization. In some instances, the same codon is used for multiple optimization plans (fast, common, ect) to specify a certain amino acid. Because of this, some genes have instances of similarity, where the same codon is used in the same position.
Result of common/fast optimization is two coding sequences that may be similar, as in some instances a codon may be both common and fast.
The slow insertion time design was based solely on recently designed software which analyzes the biophysical phenomena that underlie translation elongation. Thus, the difference between the slow insertion time GFP and the others is that it is optimized based on the results of biophysical modeling instead of codon usage profile, and thus, on understanding of the physics of translation elongation verses the understanding that evolution optimizes organisms for high efficiency. The overall hypothesis of this research can now be fully understood.
This hypothesis is that the maximum translation rate capacity is due to translation elongation becoming the rate limiting step of protein synthesis, and that it could be controlled by increasing the translation elongation rate, through codon optimization of the CDS. Essentially, the presence of more common or fast codons is hypothesized to raise the maximum translation rate capacity, while the presence of more slow, rare, or slow insertion time codons will lower it.
To test this hypothesis, five variants of the Green Fluorescent Protein gene (GFP), one each of purely fast, slow, rare, and common codons, as well as one with codons having the slowest insertion time, were designed and constructed. The TIR of each variant GFP was then varied by attaching ribosome binding sites (RBS) of varying strength to the synthetic genes. The genes were then expressed in E. coli cells.
Shows the similarity between GFPs. Genes with opposite criteria, such as slow/fast and common/rare, show very little similarity.
To design the genes a custom program was created which replaces all degenerate codons in a gene with the desired codons, for example, replacing all rare codons with common degenerate codons or all slow codons with fast degenerate codons. The variants were sent for construction at a commercial laboratory (Integrated DNA Technologies), and then inserted into viral DNA vectors which were incorporated into the cells’ plasmids through existing replication machinery. This was accomplished through basic cloning, and the expression of the fluorescent protein was then characterized using flow cytometry, a quantitative method of measuring fluorescence. Using this data, the maximum translation rate capacity of each variant GFP was determined and that data was used to distinguish between rare, slow, frequent, and fast codons.
3) Introduce the Synthetic Genes Optimized Using Our Criteria into E. coli
Our general plan for expressing our variant GFPs in living cells was to ligate the genes into a vector, transform the cells with the vector, then sequence to confirm the presence of our variant genes. After this we ligated in a dRBS, measured the florescence of the cells, and then sequenced again to determine which colonies were using which RBS.
The figures below show the construction process that we used to insert an RBS library and variant GFP into plasmid pFTV.
Through inverse PCR, we cut away the existing superfolder GFP while amplifying the rest of the plasmid. By adding "tails" to our primers outside the annealing sites we were able to introduce new restriction sites into the plasmid.
Inverse PCR Products
Inverse PCR cuts out the pre-existing coding sequencing while simultaneously amplifying the plasmid backbone with the new restriction sites.
Each GFP variant is inserted separately
A decision was reached to use a leader sequence to homogenize the first 60 base pairs of each GFP.
Why use a leader Sequence?
Leader Sequence is heavily optimized to ensure it does not become the rate limiting step in translation. It ensures an even range of translation initiation is sampled across all variants by homogenizing the first 60 base pairs, which can impact TIR.
Our construct before insertion of the dRBS
The dRBS will be inserted between Sac1 and Pst1
pFTV carrying a variant GFP. The dRBS will be inserted between sites Sac1 and Pst1.
The spacer which had held a place for the dRBS is cut out by the enzymes Sac1 and Pst1
dRBS is flanked by restriction sites Sac1 and Pst1 and is manufactured by annealing two compliementary oligos that contain the dRBS.
The Ribosome Binding Site
The degenerate ribosome binding site (dRBS) is a sequence that contains a ribosome binding site library. Using software developed by the Salis Lab, we calculated the range of translation initiation (TIR) that would be expected for this sequence, from 0.5-157,000 au.
The dRBS sequence is the location where the ribosome binds, and a higher TIR will allow more ribosomes to bind to the mRNA. It was essential to measure the performance of our synthetic GFPs over a wide range of TIR in order to see if expression plateaued at high TIR, indicating that translation elongation was the rate limiting step, or whether it climbed along with TIR, indicating that elongation had been made sufficiently more efficient as to avoid any plateau.
The sequence carries five degenerate letters. Four of these specify one of two possible bases, while the other specifies one of three. Because of this, there are 2*2*2*2*3 = 48 possible sequences in our dRBS.
Graph showing the sequences in our library and their calculated TIR
Each number on the x axis corresponds to one of the sequences out of the total 48. TIR is graphed on the y axis.
Salis, Howard, Voight, Christopher, and Mirsky, Ethan. “Automated design of synthetic ribosome binding sites to control protein expression.” Nature Biotechnology 27 (2009): 946 – 950. Web.
The final construct
Includes a promoter, RBS, leader sequence, variant gene, and terminator
4) Characterize the GFPs by Measuring Fluorescence of the Cells
We were able to characterize the fluorescence of the fast codon variant of superfolder GFP.
This graph shows the absorbance of our colonies versus time.
It is clear that several strains exceed the rest.
The strongest strains did not bear the highest TIR ribosome binding sites.
Unexpectedly, the most prolific strains did not contain the strongest ribosome binding sites. It is possible that at very high TIR the cells experienced toxicity or too much metabolic strain from the production of GFP. With further characterization of the GFPs it will be possible to compare the expression of strains containing the same RBS but different variant GFPs. Through this analysis we will be able to determine the strength of our optimization criteria and hopefully add to the ability of future engineers to optimize their genetic systems.
Material and methods
Human embryonic stem cell culture and differentiation
The human embryonic stem cell (hESC) line Hues9 (H9) was obtained from the Wicell Research Institute (Madison, WI). hESCs were maintained in Essential 8 media (Thermo Fisher Scientific) on hESC-qualified Matrigel (Corning)-coated plates at 37 °C with 5% CO2. Cultures were dissociated in clumps using 0.5 mM EDTA in PBS every 4 days and media was renewed daily. Differentiating hESCs were cultured in MEF-conditioned KSR media plus 1 μM retinoic acid in DMSO for 5 days, while the control hESC population was maintained either in MEF-conditioned KSR media completed with 4 ng/ml FGF2 or Essential 8 media (Thermo Fisher Scientific). During the course of the experiment, the media were replaced daily. KSR media consists of 85% KO-DMEM (Thermo Fisher Scientific), 15% KO-serum replacement (Life Technologies), 1 mM Glutamax (Thermo Fisher Scientific), 0.1 mM 2-mercaptoethanol (Thermo Fisher Scientific), and 0.1 mM non-essential amino acids (Thermo Fisher Scientific). In the embryoid body experiments, 70% confluent hESCs were dissociated in clumps using 0.5 mM EDTA in PBS and seeded in ultra-low attachment well plates (Corning) maintaining a 1:1 dilution factor. hESCs were cultured in Essential 6 media (Thermo Fisher Scientific) plus 10 μM rock inhibitor (Y-27632) (Stem Cell Technologies Canada) for the first 24 h and for further 4 to 6 days in Essential 6 media (Thermo Fisher Scientific), when samples were collected for qPCR or Western blot experiments.
RNA, qPCR, and Western blot
Total RNA was extracted using TRIZOL (Thermo Fisher Scientific) according to the manufacturer instructions. Reverse transcription was performed using SuperScript III Reverse Transcriptase (Thermo Fisher Scientific) and random primers (Promega). Quantitative PCR were run using TaqMan probes (Thermo Fisher Scientific) for eukaryotic 18S rRNA (X03205.1), ADAT2 (Hs00699339_m1) CDX2 (Hs01078080_m1), DLX3 (Hs00270938_m1), DLX5 (Hs01573641_mH), DNMT3B (Hs00171876_m1), FOXD3 (Hs00255287_s1), GATA6 (Hs00232018_m1), HOXA1 (Hs00939046_m1), NANOG (Hs02387400_g1), POU5F1 (Hs03005111_g1), PRDM14 (Hs01119056_m1), and TDGF1 (Hs02339497_g1).
Protein extracts used for western blotting were prepared in RIPA buffer (50 mM sodium chloride, 1.0% NP-40, 0.5% sodium deoxycholate, 0.1% SDS, 50 mM Tris, pH 8.0). The antibodies used in the experiments were anti-HSP90 (sc-13119) 1:5000 and anti-OCT3/4 (sc-5279) 1:1000 both from Santa Cruz Biotechnologies and anti-ADAT2 (ab135429) 1:1000 and anti-ADAT3 (ab125514) 1:1000, both from Abcam. The Western blot band intensity was measured using ImageJ.
For immunofluorescence experiments, hESCs were plated on Matrigel-coated coverslips and cultured as described. At the desired time point, cells were washed in PBS and fixed for 10 min in ice-cold 4% PFA. Cell permeabilization was performed in 0.01% Triton x-100 in PBS for 5 min at room temperature. Coverslips were blocked for 1 h at room temperature using 10% donkey serum in PBS. Anti-ADAT2 (ab135429) 1:100 and anti-ADAT3 (ab125514) 1:100 primary antibodies were incubated overnight in the same blocking solution. Alexa Fluor secondary antibodies (Thermo Fisher Scientific) were used at a dilution of 1:500 in blocking solution. Nuclei were counterstained using DAPI. Images were acquired on a Leica SP8 Confocal microscope and processed.
Preparation of RNA-seq, Ribo-seq, and tRNA seq libraries
Total RNA from self-renewing and differentiated hESCs (four biological replicates per condition) was extracted using Trizol according to the manufacturer’s instruction. Total RNA was treated with the RiboZero Magnetic Kit (Epicentre, MRZH11124) to remove ribosomal RNA. Libraries for sequencing were prepared using the NEB ultra-directional library prep kit (NEB).
Ribosome-protected RNA was isolated as described before [61, 74]. Briefly, at the indicated time points, H9 cells (4 replicates) were washed with PBS and lysed in 20 mM Tris-Cl (pH 7.4), 150 mM NaCl, 5 mM MgCl2, 1 mM dithiothreitol (DTT) (Sigma), 1% Triton X-100 (Sigma), 25 U ml − 1 of Turbo DNase I (Thermo Fisher Scientific), and 100 μg ml −1 of cycloheximide (Sigma). After passing the lysates ten times through a 26-G needle, they were spun down at 13,000 rpm for 10 min. Digestion with RNaseI (100 U μl −1 , Thermo Fisher Scientific) for 45 min at room temperature was used to produce ribosome mRNA footprints. The RNaseI digestion was inhibited with SuperaseIN (Thermo Fisher Scientific), and lysates were fractionated on a 1 M sucrose cushion by ultracentrifugation at 45,000 rpm in a 70Ti rotor for 9 h at 4 °C. The ribosome mRNA footprints were further purified using Qiazol reagent. Footprints with a length of 26–34 nucleotides were size selected on 15% TBE-urea gel (Thermo Fisher Scientific) and 3′-dephosphorylated with T4 polynucleotide kinase (10 U, NEB). All samples were multiplexed and sequenced on the HiSeq2500 platform (Illumina). The NGS data are up-loaded onto GEO (GSE123611): Bornelöv S, Selmi T, Flad S, Dietmann S, Frye M. Codon usage optimization in pluripotent embryonic stem cells. Datasets. Gene Expression Omnibus.https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123611(2019).
For tRNA sequencing, differentiating, and self-renewing, H9 (four replicates per condition) were collected in TRIZOL (Thermo Fisher Scientific) at day 5 of differentiation. Fifty microgrammes of total RNA samples was DNaseI digested (Ambion) at 37 °C for 30 min. The treated samples were then subjected to phenol-chloroform extraction and isopropanol precipitation and resuspended in 1 mM EDTA 0.1 M Tris-HCl (pH 9.0). One microgramme of each replicate was used to generate cDNAs and perform qPCR to confirm differentiation. The remaining sample was heated at 37 °C for 30 min in order to de-aminoacylate tRNAs. Subsequently, samples were heated for 5 min at 80 °C and then loaded on a Novex™ TBE-Urea polyacrylamide gel (Thermo Fisher Scientific). A gel fragment spanning from 50 to 100 nucleotides, according to the Abnova RNA marker low easy, was excised from the gel and resuspended in 400 μl of a solution of 300 mM NaAc pH 5.5, 1.0 mM EDTA, and 0.25% SDS. RNA was extracted by a cycle of 30-min freezing at − 80 °C followed by overnight gentle mixing at room temperature in the same buffer, and subsequently was ethanol precipitated. RNA ends of any tRNA fragments were phosphorylated by treatment with 20 U of T4 Polynucleotide Kinase (NEB) supplemented with ATP (NEB). Following heat inactivation and phenol chloroform extraction, and isopropanol precipitation, samples were subjected to library preparation using the Illumina Truseq small RNA kit (Illumina). Samples were multiplexed and sequenced on the HiSeq4000 platform (Illumina).
Processing of RNA-seq, Ribo-seq data, and tRNA-seq data
Human RNA-seq and Ribo-seq data were obtained as described above. Mouse RNA-seq and Ribo-seq data with accession numbers SRR315620-SRR315627 and SRR31591-SRR31600 were downloaded from the NCBI short read archive . Trim_galore! was used to trim small RNA adapters from the human data (auto-detected under default settings) or polyadenylation sequences from the mouse data (using the parameters “--stringency 1” and “--adapter” with ten “A” followed by multiple “N”). Trimmed reads shorter than 20 nt were discarded. Alignment was done using Tophat2 (v2.1.0) . The RNA-seq reads were aligned directly to the reference genome (hg38 or mm10). Ribo-seq reads were firstly aligned to a set of known rRNA and tRNAs (selected from the UCSC RepeatMasker tracks), followed by alignment of all unmapped reads to the reference genome. An index with known transcripts (Gencode v23 or Gencode vM9) were provided for the genome alignment, and novel splice junctions were permitted [76, 77]. Multi-mapping read alignments were not used.
For tRNA-seq data analyses, Trim galore! with “--paired --stringency 6 -a TGGAATTCTCGG -a2 GATCGTCGGACT” was used to trim small RNA adapters and to remove reads shorter than 20 nucleotides. Alignment was done using bowtie  with “-n 2 -y -k 1 --nomaqround --allow-contain” to allow for up to two mismatches in the alignment and to report one alignment per read. The reads were aligned to the 430 tRNAs from the high-confidence set in GtRNAdb . Introns were removed, and the “CCA” tail was added before creating the bowtie index.
Modified bases were identified using samtools mpileup with “-BQ 0 -t AD -vuf” to report base coverage per positions. Inosine modifications were identified as positions with an A-to-G substitution at position 33–35. The presence of the expected anticodon in the reference sequence was used to verify that only substitutions at the correct nucleotide were counted. We identified 21 Ala, 7 Arg, 8 Ile, 9 Leu, 9 Pro, 8 Ser, 7 Thr, and 9 Val tRNAs with at least one mapped read supporting the substitution. These are the same eight tRNA isotypes that were expected to be modified. Next, we estimated the overall modification level per sample as the percentage of reads supporting the modification (A-to-G substitution) across all 78 modified tRNAs.
DeepTools were used to quantify the Ribo-seq coverage across the whole reference genome . Each strand was quantified separately, and a blacklist file containing all rRNA, tRNA, snoRNA, snRNA, and miRNA was provided. The bin size was set to 1 and an offset of 12 was used to only consider a single nucleotide corresponding to the “P” site predicted from each read. A meta-gene profile for all protein coding genes was then computed using computeMatrix with the parameters “scale-regions -b 500 --unscaled5prime 200 --regionBodyLength 1000 --unscaled3prime 200 -a 500” to define relative coordinates and “--metagene --exonID CDS --transcriptID gene --transcript_id_designator gene_id” to define the coding sequence of each gene. Finally, a trimmed mean was used to exclude the most extreme value at each position.
Differential mRNA levels and translation
Gene models were downloaded from Gencode (v23 for human and vM9 for mouse). FeatureCounts was used to quantify the number of reads per gene, represented by either all annotated exons (RNA-seq) or all annotated coding sequences (Ribo-seq) . Only reads aligning to the sense strand of the gene and with mapping quality of at least 20 was counted. For the paired-end RNA-seq, the additional flags “-B -C” were specified to exclude chimeric reads and/or reads mapping with only one end. Differential mRNA expression and translation analyses were done with DESeq2 . The heatmaps were created using the R package pheatmap (https://CRAN.R-project.org/package=pheatmap), and the amino acid logo was created using the DiffLogo package .
Determining read periodicity and the position of the ribosome P-site
Ribosome profiling data display 3-nt periodicity, but the reading frame of the 5′ read ends differ between different experiments and even between read lengths within the same experiment. The human data showed the strongest periodicity for reads of length 27–29 (Additional file 1: Figure S1E-F), the 5′ end of those reads were highly enriched for the first codon position, and the most frequent starting position was located 12 positions upstream of the TIS. We therefore decided to base our codon analysis in humans on these three read lengths and to extract read position 12–14 as the P-site codon.
The mouse data showed different dominating reading frames for different read lengths (Additional file 1: Figure S4C). To predict the strongest reading frame for each read length, we again calculated the number of reads starting from each position around the TIS (see Additional file 1: Figure S4D for an example). Next, we counted the most frequent reading frame both per codon (from codon position − 7 to + 32 relative to the TIS) and across the whole region. We excluded read lengths where the reading frame that was most abundant in the highest number of codons was different from the one that was most abundant in total. Furthermore, we required the most abundant reading frame was at least five-fold more abundant than the second most abundant both in terms of the number of codons and in total. This resulted in read lengths 26, 29, 31–36, and 39 being selected for the mouse data. Out of these, read length 39 was later excluded due to having substantially fewer reads and showing poor correlation to the others. Since the highest number of reads started 13 or 14 positions upstream of the TIS (depending if frame 3 or 2 was most abundant), we trimmed one or two nucleotides of the reads in the subsequent analysis so that the P-site would still be located at read position 12–14 similar to the human data.
Extracting codon counts per position relative to the A-site
The bam file for each sample with uniquely aligned reads was converted to bed format. Bedtools intersect was used to select reads with at least 50% overlap to Gencode-annotated coding sequences. Next, the reading frame of the 5′ end of each read was determined using the frame information in the Gencode annotation. If the frame did not agree with the expected reading frame for that read length, the read was discarded. For the mouse data, reads starting with the third or second position of a codon had one or two nucleotides trimmed off the read to put all reads into the same reading frame. Then, nucleotide positions 1–27 were extracted from each read as nine codons, numbered as codon position − 5 to + 3, where 0 corresponded to the A-site. Next, we counted the number of occurrences of each codon per position, and we validated the approach by verifying that codon counts calculated using different read lengths but corresponding to the same codon position (e.g. the P-site) correlated better to each other than to any other codon counts. Furthermore, counts from position − 4 to − 2 and + 1 to + 3 correlated well to the genome-wide distribution of codons in the human and mouse translatomes, whereas counts from the predicted P-site and A-site did not. We therefore concluded that we were counting codons from the correct frame for each included read length for both human and mouse.
The number of codon occurrences were counted separately for each ribosome-protected codon position and converted into a fraction of the total number. Normalized codon counts were obtained by dividing the codon fraction at a specific position by the mean fraction across all nine positions. Significant differences in normalized codon enrichment were calculated using Student’s t test using the four biological replicates per cell state. The p values were corrected for multiple testing using FDR correction. The figures are showing the signal across pooled replicates.
GO term analysis
Coding sequences were downloaded from the consensus CDS project. GO relations (go-basic.obo) were downloaded from the Gene Ontology Project (http://www.geneontology.org) and parsed using the Perl module GO::Parser. Relative codon frequencies were calculated per gene and then averaged per GO term. Only GO terms with at least 40 genes were considered. The PCA analysis was performed in R using the method princomp using the correlation matrix.