# How many proteins are in the Earth's proteome?

Humans alone have thousands of proteins. With that in mind, it seems like the total number of proteins among all species would be very large.

Are there any available estimates for how many proteins exist on earth in all organisms? I'd also be interested in how many of these are unique proteins as opposed to proteins that are very similar to other proteins, i.e an estimate of non-redundant proteins alongside the redundant proteins.

# Current records

According to Uniprot, there are 85,381,808 protein records, and with the UniRef90 filter (i.e removing records that can be represented by an entry with at least 90% sequence similarity), there are 42,424,511. However, these databases are moving targets and will change over time. We will sequence more species, find novel splice isoforms, and various other methods will expand the databases. Indeed, the databases will be truncated also from time to time as some hypothetical proteins may be based on genes that turn out to not code for proteins after all.

In 2007, a study estimated that the earth's proteome would contain around 5 million sequences, and that the majority of these would be elucidated by 2012. I suspect this is a very thorough study, however, a lot has changed in the last 10 years. This estimate is actually less than the nearly 9 million species estimated in more recent studies.

# Approximate estimate

So let's do some back of the envelope maths. Let's assume the article estimating nearly 9 million species is about right, and that we've only catalogued some 1.2 million. But UniProt isn't even close to this number.UniProt contains 25477 scientific names in it's controlled vocabulary. So for 25 thousand names, we have 85 million protein records. What if we had 8.75 million names? Let's assume:

$$frac{Predicted~Proteins}{Predicted~Species}=frac{Known~Proteins}{Known~Species}$$

We can rearrange this to:

$$frac{Predicted~Species~ imes~{Known~Proteins}}{Known~Species}= Predicted~Proteins$$

Generous estimate (Uniprot, 335527 proteins per species):

$$frac{8750000 imes{85381808}}{25477}=2.932413e^{+10}$$

Conservative estimate (Swissprot, 41 proteins per species):

$$frac{8750000 imes{554241}}{13408}=3.616952e^{+7}$$

For the sake of completeness let's assume the number of <90% identical proteins will remain around half that value. We can say that there might be around $$1.8e^{+7}$$ to $$1.5{e}^{+10}$$ "unique" proteins, less than a trillion ($$1e^{+12}$$). Given the absurdly generous 335 thousand proteins and very stingy 41 proteins per species, we can be fairly certain that if there are indeed 8.75 million species, the number of proteins will fall between those estimates.

The biggest assumption here is that the proteins have a linear relationship with species which is unlikely to be the case, and at the generous estimate we are pretending that there are no proteins in UniProt that don't have species annotation. As for Swissprot, this only includes proteins that have been manually curated, so this ignores many proteins that are safe to assume exist and typically only covers proteins that are of interest to scientists.

A minor correction to your question, the UniProt lists ~20 thousand protein coding genes in the human proteome, not millions. Those protein coding genes are subject to various post translational modifications and isoform splicing, so there will be more final proteins than 20k.

## How to construct a protein factory

The complexity of molecular structures in the cell is amazing. Having achieved great success in elucidating these structures in recent years, biologists are now taking on the next challenge: to find out more about how they are constructed. A new research project now provides insight into a very unusual construction process in the unicellular parasite Trypanosoma brucei.

Cells consist of a multitude of molecular structures, some of them exhibiting a staggering complexity. Ribosomes, the protein factories of the cell, belong to the biggest and most sophisticated complexes and are made up of RNA as well as a large number of proteins. They exist in every living being and are considered as one of the cellular machines that has changed the least through all stages of the evolution. But there are exceptions: In mitochondria, cellular organelles that serve as power plants, ribosomes look considerably different.

An extensive machinery

Scientists are not only interested in the structure and function of such ribosomes, but also in the "construction process" -- how do cells manage the assembly of these complex structures? And how do these construction methods differ, for different structures? It is clear that an extensive cellular machinery is needed to guarantee for a smooth assembly of all the building bricks. This cellular machinery responsible for ribosome assembly in mitochondria has not been described yet. Now, researchers from the André Schneider group of the University of Bern and the Nenad Ban group of ETH Zurich, investigated the mitochondrial ribosome assembly process using the unicellular parasite Trypanosoma brucei. They were able to follow the construction process and to identify the associated cellular machinery dedicated to assemble these mitoribosomes. Since T. brucei causes hardly treatable diseases including sleeping sickness, the results could lead to new therapies. The project was made possible by the National Center of Competence in Research "RNA & Disease," which studies the role of RNA in disease mechanisms. The findings have now been published in "Science."

Unknown elements in the "construction business"

The parasite Trypanosoma brucei was used as a model system since its mitoribosomes are particularly complex and, therefore, likely to require many assembly steps. The researchers could follow all these steps in detail. "We have found fascinating differences," says Moritz Niemann from the Department of Chemistry and Biochemistry of the University of Bern, co-author. In mitochondrial ribosomes RNA can be considered as the steel in reinforced concrete, whereas in other ribosomes it can be considered to play key structural role as in iron-based structures such as the Eiffel Tower. Analysis showed that the assembly of mitoribosomes in T. brucei proceeds through the formation of several assembly intermediates. It also involves a large number of proteins that form a huge adaptive scaffolding around the emerging mitoribosome that is not present in the completed structure. Martin Saurer from the Department of Biology of ETH Zurich and first author, says that many of these proteins were unknown in the "construction business." "Cryo-electron microscopy does not only allow us to visualize known complexes but also to discover and describe an entire cellular process: the construction site and the machinery involved in assembling mitochondrial ribosomes," he adds. Moritz Niemann was especially baffled by the enormous effort the cell is putting into this: "Up to a quarter of all proteins in the mitochondrion are components of the mitoribosomes or are required to build them."

Better understanding leads to new therapies

Since several of the assembly proteins have look-alikes in other organisms, the researchers believe that the obtained insights provide general information for better understanding ribosomal maturation in all organisms. And because all these proteins are essential for the functioning of the cell, these findings could be useful for developing therapies against T. brucei and related parasites that cause many devastating diseases in humans and animals.

## How many proteins are in the Earth's proteome? - Biology

Figure 1: Gallery of proteins. Representative examples of protein size are shown with examples drawn to illustrate some of the key functional roles they take on. All the proteins in the figure are shown on the same scale to give an impression of their relative sizes. The small red objects shown on some of the molecules are the substrates for the protein of interest. For example, in hexokinase, the substrate is glucose. The handle in ATP synthase is known to exist but the exact structure was not available and thus only schematically drawn. Names in parenthesis are the PDB database structures entries IDs. (Figure courtesy of David Goodsell).

Proteins are often referred to as the workhorses of the cell. An impression of the relative sizes of these different molecular machines can be garnered from the gallery shown in Figure 1. One favorite example is provided by the Rubisco protein shown in the figure that is responsible for atmospheric carbon fixation, literally building the biosphere out of thin air. This molecule, one of the most abundant proteins on Earth, is responsible for extracting about a hundred Gigatons of carbon from the atmosphere each year. This is ≈10 times more than all the carbon dioxide emissions made by humanity from car tailpipes, jet engines, power plants and all of our other fossil-fuel-driven technologies. Yet carbon levels keep on rising globally at alarming rates because this fixed carbon is subsequently reemitted in processes such as respiration, etc. This chemical fixation is carried out by these Rubisco molecules with a monomeric mass of 55 kDa fixating CO2 one at a time, with each CO2 with a mass of 0.044 kDa (just another way of writing 44 Da that clarifies the 1000:1 ratio in mass). For another dominant player in our biosphere consider the ATP synthase (MW≈500-600 kDa, BNID 106276), also shown in Figure 1, that decorates our mitochondrial membranes and is responsible for synthesizing the ATP molecules (MW=507 Da) that power much of the chemistry of the cell. These molecular factories churn out so many ATP molecules that all the ATPs produced by the mitochondria in a human body in one day would have nearly as much mass as the body itself. As we discuss in the vignette on “What is the turnover time of metabolites?” the rapid turnover makes this less improbable than it may sound.

Figure 2: A Gallery of homooligomers showing the beautiful symmetry of these common protein complexes. Highlighted in pink are the monomeric subunits making up each oligomer. Figure by David Goodsell.

The size of proteins such as Rubisco and ATP synthase and many others can be measured both geometrically in terms of how much space they take up and in terms of their sequence size as determined by the number of amino acids that are strung together to make the protein. Given that the average amino acid has a molecular mass of 100 Da, we can easily interconvert between mass and sequence length. For example the 55 kDa Rubisco monomer, has roughly 500 amino acids making up its polypeptide chain. The spatial extent of soluble proteins and their sequence size often exhibit an approximate scaling property where the volume scales linearly with sequence size and thus the radii or diameters tend to scale as the sequence size to the 1/3 power. A simple rule of thumb for thinking about typical soluble proteins like the Rubisco monomer is that they are 3-6 nm in diameter as illustrated in Figure 1 which shows not only Rubisco, but many other important proteins that make cells work. In roughly half the cases it turns out that proteins function when several identical copies are symmetrically bound to each other as shown in Figure 2. These are called homo-oligomers to differentiate them from the cases where different protein subunits are bound together forming the so-called hetero-oligomers. The most common states are the dimer and tetramer (and the non oligomeric monomers). Homo-oligomers are about twice as common as hetero-oligomers (BNID 109185).

There is an often-surprising size difference between an enzyme and the substrates it works on. For example, in metabolic pathways, the substrates are metabolites which usually have a mass of less than 500 Da while the corresponding enzymes are usually about 100 times heavier. In the glycolysis pathway, small sugar molecules are processed to extract both energy and building blocks for further biosynthesis. This pathway is characterized by a host of protein machines, all of which are much larger than their sugar substrates, with examples shown in the bottom right corner of Figure 1 where we see the relative size of the substrates denoted in red when interacting with their enzymes.

Figure 3: Distribution of protein lengths in E. coli, budding yeast and human HeLa cells. (A) Protein length is calculated in amino acids (AA), based on the coding sequences in the genome. (B) Distributions are drawn after weighting each gene with the protein copy number inferred from mass spectrometry proteomic studies (M. Heinemann in press, M9+glucose LMF de Godoy et al. Nature 455:1251, 2008, defined media T. Geiger et al., Mol. Cell Proteomics 11:M111.014050, 2012). Continuous lines are Gaussian kernel-density estimates for the distributions serving as a guide to the eye.

Table 1: Median length of coding sequences of proteins based on genomes of different species. The entries in this table are based upon a bioinformatic analysis by L. Brocchieri and S. Karlin, Nuc. Acids. Res., 33:3390, 2005, BNID 106444. As discussed in the text, we propose an alternative metric that weights proteins by their abundance as revealed in recent mass spec proteome-wide censuses. The results are not very different from the entries in this table, with eukaryotes being around 400 aa long on average and bacteria about 300 aa long.

Concrete values for the median gene length can be calculated from genome sequences as a bioinformatic exercise. Table 1 reports these values for various organisms showing a trend towards longer protein coding sequences when moving from unicellular to multicellular organisms. In Figure 3 we go beyond mean protein sizes to characterize the full distribution of coding sequence lengths on the genome, reporting values for three model organisms. If our goal was to learn about the spectrum of protein sizes, this definition based on the genomic length might be enough. But when we want to understand the investment in cellular resources that goes into protein synthesis, or to predict the average length of a protein randomly chosen from the cell, we advocate an alternative definition, which has become possible thanks to recent proteome-wide censuses. For these kinds of questions the most abundant proteins should be given a higher statistical weight in calculating the expected protein length. We thus calculate the weighted distribution of protein lengths shown in Figure 3, giving each protein a weight proportional to its copy number. This distribution represents the expected length of a protein randomly fished out of the cell rather than randomly fished out of the genome. The distributions that emerge from this proteome-centered approach depend on the specific growth conditions of the cell. In this book, we chose to use as a simple rule of thumb for the length of the “typical” protein in prokaryotes ≈300 aa and in eukaryotes ≈400 aa. The distributions in Figure 3 show this is a reasonable estimate though it might be an overestimate in some cases.

One of the charms of biology is that evolution necessitates very diverse functional elements creating outliers in almost any property (which is also the reason we discussed medians and not averages above). When it comes to protein size, titin is a whopper of an exception. Titin is a multi-functional protein that behaves as a nonlinear spring in human muscles with its many domains unfolding and refolding in the presence of forces and giving muscles their elasticity. Titin is about 100 times longer than the average protein with its 33,423 aa polypeptide chain (BNID 101653). Identifying the smallest proteins in the genome is still controversial, but short ribosomal proteins of about 100 aa are common.

It is very common to use GFP tagging of proteins in order to study everything from their localization to their interactions. Armed with the knowledge of the characteristic size of a protein, we are now prepared to revisit the seemingly innocuous act of labeling a protein. GFP is 238 aa long, composed of a beta barrel within which key amino acids form the fluorescent chromophore as discussed in the vignette on “ What is the maturation time for fluorescent proteins?”. As a result, for many proteins the act of labeling should really be thought of as the creation of a protein complex that is now twice as large as the original unperturbed protein.

## What is a protein? A biologist explains

Just 20 amino acids for chains in various combinations to create the thousands of varieties of proteins in our body. Credit: David Goodsell/ProteinDatabase, CC BY-SA

Editor's note: Nathan Ahlgren is a professor of biology at Clark University. In this interview, he explains exactly what proteins are, how they are made, and the wide variety of functions they perform in the human body.

A protein is a basic structure that is found in all of life. It's a molecule. And the key thing about a protein is it's made up of smaller components, called amino acids. I like to think of them as a string of different colored beads. Each bead would represent an amino acid, which are smaller molecules containing carbon, oxygen, hydrogen and sometimes sulfur atoms. So a protein essentially is a string that's made up of these little individual amino acids. There are 22 different amino acids that you can combine in any kind of different way.

A protein doesn't usually exist as a string, but actually folds up into a particular shape, depending on the order and how those different amino acids interact together. That shape influences what the protein does in our body.

Where do the amino acids come from?

The amino acids in our body come from the food we eat. We also make them in our body. For example, other animals make proteins and we eat those. Our bodies take that chain and break it down into the individual amino acids. Then it can remake them into any protein that we need.

Once the proteins are broken down into amino acids in the digestive system, they are taken to our cells and kind of float around inside the cell, as those little individual beads in our analogy. And then inside the cell, your body basically connects them together to make the proteins that your body needs to make.

We can make about half of the amino acids we need on our own, but we have to get the others from our food.

What do proteins do in our body?

Scientists are not exactly sure, but most agree that there are around 20,000 different proteins in our body. Some studies suggest that there might be even more. They carry out a variety of functions from doing some metabolic conversions to holding your cells together to causing your muscles to work.

Their functions fall into a few broad categories. One is structural. Your body is made up of many different kinds of structures—think of stringlike structures, globules, anchors, etc. They form the stuff that holds your body together. Collagen is a protein that gives structure to your skin, bones and even teeth. Integrin is a protein that makes flexible linkages between your cells. Your hair and nails are made of a protein called keratin.

Another big role that they take on is biochemistry—how your body carries out particular reactions in your cell, like breaking down fat or amino acids. Remember when I said our body breaks down the protein from the food that we eat? Even that function is carried out by proteins like pepsin. Another example is hemoglobin – the protein that carries oxygen around in your blood. So they're carrying out these special chemical reactions inside yourself.

Proteins can also process signals and information, like circadian clock proteins which keep time in our cells, but those are a few main categories of functions that proteins carry out in the cell.

Why is protein often associated with muscles and meat?

Different types of foods have different kinds of protein content. There are a lot of carbohydrates in plants like wheat and rice, but they are less rich in protein content. But meat in general has more protein content. A lot of protein is required to make the muscles in your body. That's why protein is often associated with eating meat and building muscle, but proteins are really involved in much, much more than that.

## The Institute for Creation Research

Evolution means change, but when we look to the living world, we see no significant change (macroevolution). Consider the troublesome creatures that Darwin labeled "living fossils." These are organisms that were supposedly extinct for many millions of years, only to appear in the twentieth and twenty-first centuries alive and kicking.

Australia is the home of a beetle discovered alive in 1998 but supposed by Darwinists to have been extinct for "200 million years." It hasn't changed at all. There are dragonfly fossils over "300 million years old" with wing venation virtually identical to dragonfly wings today. There is no change. Millipedes supposedly have been crawling around for "420 million years"! A fascinating plant example is the "150 million-year-old" Wollemi Pine discovered in 1994 and 2000 west of Sydney, Australia. One must ask: Is it logical to assume that a stand of trees can stay in one physical location for over 150 million years and not come to any demise? Those fortunate researchers obtaining permits to visit the secret location of the Wollemi Pine stands must first change clothes to avoid possible contamination of the trees with foreign bacteria, viruses, or spores. But why worry about contagion? The Wollemi Pine should be unbelievably hearty after all those millions of years. Random air currents and rains over the millennia should bring every kind of "bug" to infect these trees and their ancestors a hundred thousand times over.

Another young-earth indicator involves the degradation of organic compounds (i.e., protein) in a geological environment. There's no question, even among some evolutionary naturalists, that unmineralized dinosaur bone still containing bone protein resides in many locations throughout the world. 1 This is amazing, and destroys the mantra of dinosaurs becoming extinct "65 million years ago." Simply put, bone containing such well preserved protein could not possibly have existed for more than a few thousand years in the geological settings in which they are found.

In August of 2004 the BBC News reported the North Greenland Ice Core Project (NGrip) recovering what appears to be blades of grass or pine needles from cores two miles below the surface. While allegedly several million years old, the possible organic matter suggests the Greenland ice sheet formed quickly. 2

Creation biologists view the age of the earth in terms of only thousands of years. Living fossils, ancient "plant matter," and dinosaur protein are not a problem if the earth is young. Evolutionists on the other hand, must posit impossibly long periods of no evolutionary change when an "extinct" creature appears alive, and then make excuses why the creature never changed.

## Amino acid sequences influence a protein’s chemical properties

Sanger’s discovery with insulin revealed not just how proteins have defined chemical structures, but also why different proteins have different functions. Just as different letters of the alphabet have different sounds, the various R chains give the twenty amino acids different chemical properties. Thus, stringing amino acids together in different combinations leads to proteins with extremely diverse properties and shapes.

Sanger’s insulin research acted as a springboard for work by other protein chemists during the 1950s and 60s involving how structure relates to function. By passing X-rays through various proteins, researchers obtained images of their 3-dimensional structures. Studying the images and working out issues related to the physics of chemical bonds, biochemists of the mid 20th century learned that the amino acid sequence represents protein structure on just one level. They started referring to the sequence as the primary structure, since it leads the protein chain to twist and bend in ways that give the protein a more complex shape.

Certain amino acids enable a polypeptide chain to bend, for example, while other amino acids hold the chain more rigid (Figure 9). Some R chains are very hydrophilic they like being in water and thus make the amino acid water-soluble. Other R chains are hydrophobic they don’t mix with water. Often, having a hydrophobic area, or “pocket”, within a protein can help the protein do its particular job, for instance grabbing a hydrophobic substrate in order to modify it chemically.

 Primary Structure Secondary Structure Tertiary Structure Quaternary Structure Figure 9: The various protein structures.

Depending on their R chains, amino acids also can vary in terms of their acidity and alkalinity. When the environment is neutral (pH 7), the amino acids aspartate and glutamate act as acids, whereas arginine and lysine act as bases, and this too has major implications for a protein’s properties.

One finished paper amino acid.

We have a fun paper folding activity. Remember how proteins are made of building blocks called amino acids, and have their own special shape? Not only do proteins look different, they have different jobs to do inside the cell to keep your body running smoothly.

The protein we made is a channel that sits in the outer cell surface, or membrane, and works like a door that lets certain molecules pass through. Some channels are open all the time while others can be closed depending on signals from the cell or the environment. When the channel is open, other molecules can enter the cell by passing through the hole in the middle.

As you'll discover while building your origami channel, the shape of a protein is very important. If you don't fold your origami amino acids correctly, they wouldn’t fit together to make a protein chain. Or, if you make a mistake joining amino acids together, the finished channel might not be able to open and close correctly.

In nature the same thing can happen. If a protein is the wrong shape it will not work correctly.

Materials: You will need 8 square pieces of paper of the same size.

Tips: The best way to make folds is to lay the paper down on a hard, flat surface, such as a table. It's important to pay attention to the direction of the paper and make sure not to change it's orientation when following instructions.

You can find out more about how proteins fold into unique shapes to make and do work inside your body in the Protein Science section.

You can also download and print our Origami Protein Handout (PDF) for step-by-step instructions of how to make your protein channel, or watch this step by step video.

1. Fold a single piece of paper in half diagonally
2. Fold the paper in half diagonally again
3. Your folded paper should look like this
4. Unfold the paper

5. Fold the paper in half
6. Fold the paper in half again
7. Your folded paper should look like this

8. Unfold the top layer of the square halfway
9. Open the top layer of the square and flatten it into a triangle, using the existing creases.
10. Your folded paper should look like this

11. Flip it over
12. Unfold the top layer halfway
13. Open the top layer and flatten it into a triangle, using the existing creases.
14. Your folded paper should look like this

15. Fold the edges of the top layer only into the centerline
16. Your folded paper should look like this
17. Flip it over
18. Fold the edges of the top layer only into the centerline
19. You've now completed one amino acid. Repeat these steps with another piece of paper until you've created a total of eight amino acids.

And, that's it! Once you have amino acids, you are ready to move onto Part 2 to make the protein channel.

## 3.4 Proteins

In this section, you will investigate the following questions:

• What are functions of proteins in cells and tissues?
• What is the relationship between amino acids and proteins?
• What are the four levels of protein organization?
• What is the relationship between protein shape and function?

### Connection for AP ® Courses

Proteins are long chains of different sequences of the 20 amino acids that each contain an amino group (-NH2), a carboxyl group (-COOH), and a variable group. (Think of how many protein “words” can be made with 20 amino acid “letters”). Each amino acid is linked to its neighbor by a peptide bond formed by a dehydration reaction. A long chain of amino acids is known as a polypeptide. Proteins serve many functions in cells. They act as enzymes that catalyze chemical reactions, provide structural support, regulate the passage of substances across the cell membrane, protect against disease, and coordinate cell signaling pathways. Protein structure is organized at four levels: primary, secondary, tertiary, and quaternary. The primary structure is the unique sequence of amino acids. A change in just one amino acid can change protein structure and function. For example, sickle cell anemia results from just one amino acid substitution in a hemoglobin molecule consisting of 574 amino acids. The secondary structure consists of the local folding of the polypeptide by hydrogen bond formation leading to the α helix and β pleated sheet conformations. In the tertiary structure, various interactions, e.g., hydrogen bonds, ionic bonds, disulfide linkages, and hydrophobic interactions between R groups, contribute to the folding of the polypeptide into different three-dimensional configurations. Most enzymes are of tertiary configuration. If a protein is denatured, loses its three-dimensional shape, it may no longer be functional. Environmental conditions such as temperature and pH can denature proteins. Some proteins, such as hemoglobin, are formed from several polypeptides, and the interactions of these subunits form the quaternary structure of proteins.

Information presented and the examples highlighted in the section, support concepts and Learning Objectives outlined in Big Idea 4 of the AP ® Biology Curriculum Framework. The Learning Objectives listed in the Curriculum Framework provide a transparent foundation for the AP ® Biology course, an inquiry-based laboratory experience, instructional activities, and AP ® exam questions. A Learning Objective merges required content with one or more of the seven science practices.

 Big Idea 4 Biological systems interact, and these systems and their interactions possess complex properties. Enduring Understanding 4.A Interactions within biological systems lead to complex properties. Essential Knowledge 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. Science Practice 7.1 The student can connect phenomena and models across spatial and temporal scales. Learning Objective 4.1 The student is able to explain the connection between the sequence and the subcomponents of a biological polymer and its properties. Essential Knowledge 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. Science Practice 1.3 The student can refine representations and models of natural or man-made phenomena and systems in the domain. Learning Objective 4.2 The student is able to refine representations and models to explain how the subcomponents of a biological polymer and their sequence determine the properties of that polymer. Essential Knowledge 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. Science Practice 6.1 The student can justify claims with evidence. Science Practice 6.4 The student can make claims and predictions about natural phenomena based on scientific theories and models. Learning Objective 4.3 The student is able to use models to predict and justify that changes in the subcomponents of a biological polymer affect the functionality of the molecules.

### Teacher Support

Twenty amino acids can be formed into a nearly limitless number of different proteins. The sequence of the amino acids ultimately determines the final configuration of the protein chain, giving the molecule its specific function.

### Teacher Support

Emphasize that proteins have a variety of functions in the body. Table 3.1 contains some examples of these functions. Note that not all enzymes work under the same conditions. Amylase only works in an alkaline medium, such as in saliva, while pepsin works in the acid environment of the stomach. Discuss other materials that can be carried by protein in body fluids in addition to the substances listed for transport in the text. Proteins also carry insoluble lipids in the body and transport charged ions, such as calcium, magnesium, and zinc. Discuss another important structural protein, collagen, as it is found throughout the body, including in most connective tissues. Emphasize that not all hormones are proteins and that steroid based hormones were discussed in the previous section.

The amino group of an amino acid loses an electron and becomes positively charged. The carboxyl group easily gains an electron, becoming negatively charged. This results in the amphipathic characteristic of amino acids and gives the compounds solubility in water. The presence of both functional groups also allows dehydration synthesis to join the individual amino acids into a peptide chain.

Protein structure is explained as though it occurs in three to four discrete steps. In reality, the structural changes that result in a functional protein occur on a continuum. As the primary structure is formed off the ribosomes, the polypeptide chain goes through changes until the final configuration is achieved. Have the students imagine a strand of spaghetti as it cooks in a clear pot. Initially, the strand is straight (ignore the stiffness for this example). While it cooks, the strand will bend and twist and (again, for this example), fold itself into a loose ball made up of the strand of pasta. The resulting strand has a particular shape. Ask the students what types of chemical bonds or forces might affect protein structure. These shapes are dictated by the position of amino acids along the strand. Other forces will complete the folding and maintain the structure.

The Science Practice Challenge Questions contain additional test questions for this section that will help you prepare for the AP exam. These questions address the following standards:
[APLO 1.14] [APLO 2.12] [APLO 4.1] [APLO 4.3][APLO 4.15][APLO 4.22]

### Types and Functions of Proteins

Proteins are one of the most abundant organic molecules in living systems and have the most diverse range of functions of all macromolecules. Proteins may be structural, regulatory, contractile, or protective they may serve in transport, storage, or membranes or they may be toxins or enzymes. Each cell in a living system may contain thousands of proteins, each with a unique function. Their structures, like their functions, vary greatly. They are all, however, polymers of amino acids, arranged in a linear sequence.

Enzymes , which are produced by living cells, are catalysts in biochemical reactions (like digestion) and are usually complex or conjugated proteins. Each enzyme is specific for the substrate (a reactant that binds to an enzyme) it acts on. The enzyme may help in breakdown, rearrangement, or synthesis reactions. Enzymes that break down their substrates are called catabolic enzymes, enzymes that build more complex molecules from their substrates are called anabolic enzymes, and enzymes that affect the rate of reaction are called catalytic enzymes. It should be noted that all enzymes increase the rate of reaction and, therefore, are considered to be organic catalysts. An example of an enzyme is salivary amylase, which hydrolyzes its substrate amylose, a component of starch.

Hormones are chemical-signaling molecules, usually small proteins or steroids, secreted by endocrine cells that act to control or regulate specific physiological processes, including growth, development, metabolism, and reproduction. For example, insulin is a protein hormone that helps to regulate the blood glucose level. The primary types and functions of proteins are listed in Table 3.1.

TypeExamplesFunctions
Digestive EnzymesAmylase, lipase, pepsin, trypsinHelp in digestion of food by catabolizing nutrients into monomeric units
TransportHemoglobin, albuminCarry substances in the blood or lymph throughout the body
StructuralActin, tubulin, keratinConstruct different structures, like the cytoskeleton
HormonesInsulin, thyroxineCoordinate the activity of different body systems
DefenseImmunoglobulinsProtect the body from foreign pathogens
ContractileActin, myosinEffect muscle contraction
StorageLegume storage proteins, egg white (albumin)Provide nourishment in early development of the embryo and the seedling

Proteins have different shapes and molecular weights some proteins are globular in shape whereas others are fibrous in nature. For example, hemoglobin is a globular protein, but collagen, found in our skin, is a fibrous protein. Protein shape is critical to its function, and this shape is maintained by many different types of chemical bonds. Changes in temperature, pH, and exposure to chemicals may lead to permanent changes in the shape of the protein, leading to loss of function, known as denaturation . All proteins are made up of different arrangements of the most common 20 types of amino acids.

### Amino Acids

Amino acids are the monomers that make up proteins. Each amino acid has the same fundamental structure, which consists of a central carbon atom, also known as the alpha (α) carbon, bonded to an amino group (NH2), a carboxyl group (COOH), and to a hydrogen atom. Every amino acid also has another atom or group of atoms bonded to the central atom known as the R group (Figure 3.24).

The name "amino acid" is derived from the fact that they contain both amino group and carboxyl-acid-group in their basic structure. As mentioned, there are 20 common amino acids present in proteins. Nine of these are considered essential amino acids in humans because the human body cannot produce them and they are obtained from the diet. For each amino acid, the R group (or side chain) is different (Figure 3.25).

### Visual Connection

1. Polar and charged amino acids will be found on the surface. Non-polar amino acids will be found in the interior.
2. Polar and charged amino acids will be found in the interior. Non-polar amino acids will be found on the surface.
3. Non-polar and uncharged proteins will be found on the surface as well as in the interior.

The chemical nature of the side chain determines the nature of the amino acid (that is, whether it is acidic, basic, polar, or nonpolar). For example, the amino acid glycine has a hydrogen atom as the R group. Amino acids such as valine, methionine, and alanine are nonpolar or hydrophobic in nature, while amino acids such as serine, threonine, and cysteine are polar and have hydrophilic side chains. The side chains of lysine and arginine are positively charged, and therefore these amino acids are also known as basic amino acids. Proline has an R group that is linked to the amino group, forming a ring-like structure. Proline is an exception to the standard structure of an animo acid since its amino group is not separate from the side chain (Figure 3.25).

Amino acids are represented by a single upper case letter or a three-letter abbreviation. For example, valine is known by the letter V or the three-letter symbol val. Just as some fatty acids are essential to a diet, some amino acids are necessary as well. They are known as essential amino acids, and in humans they include isoleucine, leucine, and cysteine. Essential amino acids refer to those necessary for construction of proteins in the body, although not produced by the body which amino acids are essential varies from organism to organism.

The sequence and the number of amino acids ultimately determine the protein's shape, size, and function. Each amino acid is attached to another amino acid by a covalent bond, known as a peptide bond , which is formed by a dehydration reaction. The carboxyl group of one amino acid and the amino group of the incoming amino acid combine, releasing a molecule of water. The resulting bond is the peptide bond (Figure 3.26).

The products formed by such linkages are called peptides. As more amino acids join to this growing chain, the resulting chain is known as a polypeptide. Each polypeptide has a free amino group at one end. This end is called the N terminal, or the amino terminal, and the other end has a free carboxyl group, also known as the C or carboxyl terminal. While the terms polypeptide and protein are sometimes used interchangeably, a polypeptide is technically a polymer of amino acids, whereas the term protein is used for a polypeptide or polypeptides that have combined together, often have bound non-peptide prosthetic groups, have a distinct shape, and have a unique function. After protein synthesis (translation), most proteins are modified. These are known as post-translational modifications. They may undergo cleavage or phosphorylation, or may require the addition of other chemical groups. Only after these modifications is the protein completely functional.

Click through the steps of protein synthesis in this interactive tutorial.

You’ve probably seen it listed on labels. But what exactly is it—and what are the benefits?

Hydrolyzed protein: It’s a complicated name for a complicated process. But its benefits can’t be overstated for people who have trouble digesting whole and conventional food sources of protein.

Protein is a necessary macronutrient in the human diet—one that supports everything from cell function to muscle maintenance and generation. It’s composed of long chains of amino acids, many of which are individually necessary for human health.

But for people with compromised digestive function, separating and absorbing those complex chains of amino acids can be difficult, or even impossible, says Vanessa Carr, M.S., R.D.N., L.D.N., clinical nutrition manager for Kate Farms. That’s where hydrolyzed protein comes in.

### What the heck does “hydrolyzed” mean?

“Basically, it’s the unchaining of long protein strands into smaller chains or single amino acids,” says Carr. This process involves breaking down the peptide bonds that hold amino acids together, and it’s accomplished using enzymes like the ones produced in the human pancreas or other digestive organs.

Protein molecules can be “partially” hydrolyzed, meaning their amino acid chains are cut down into smaller segments, or they can be fully hydrolyzed, meaning every amino acid has been isolated, Carr explains.

### Why is this necessary?

Digestion involves breaking down food molecules so the body can put them to good use. But again, for some, that breakdown capability is impaired. Because hydrolyzed proteins are already broken down—basically, pre-digested—the body can absorb them with little to no effort, Carr says.

### Who benefits most from hydrolyzed proteins?

They’re especially important for people who are missing parts of their intestine, along with those who have pancreatic disease or other conditions that make protein digestion a struggle.

“People with malabsorption disorders,” Carr says, “and people with food allergies or sensitivities can usually tolerate a hydrolyzed protein formula better than the others.” It’s used in hypoallergenic infant formulas, for example.

Hydrolyzed proteins may also cause less of an upset stomach for those with gut conditions like irritable bowel syndrome. They have also been shown to benefit those with slow stomach digestion.

Historically, all hydrolyzed proteins were dairy-based. That’s since changed, and Kate Farms has been at the forefront—more on this later.

### How many amino acids are found in protein?

Twenty different amino acids can combine to make protein, says Carr, though a single protein molecule can include a sequence of 200 or more single amino acids in various combinations.

Think of a protein molecule is a train. A big train can have 200 cars, made up of 20 types of cars.

### Are all of them essential?

Nope. Only nine of them are essential, which means the human body cannot produce them on its own.

Ever hear that some foods are “complete” proteins? That means the food contains all nine essential amino acids in the right proportion. While it’s true that animal sources of protein are complete, Carr says, it isn’t necessary to eat only animal products to get all the essential amino acids.

### What makes Kate Farms’ hydrolyzed protein different from everyone else’s?

Kate Farms Peptide Formulas contain hydrolyzed pea protein, supplemented with other plant-based amino acids. “We’re the first company to hydrolyze a complete plant-based protein,” Carr says. “Every other hydrolyzed formula—every single one—uses whey, which is from dairy.”

Carr says some people with dairy allergies (or intolerances) can benefit from a hydrolyzed whey protein, but they may experience symptoms when exposed to whey-sourced amino acids. “All Kate Farms products are vegan,” she says, “so there’s no animal sourcing whatsoever.”

In addition, pea protein is not one of the top 8 food allergens. Kate Farms shakes are free of corn as well, unlike the single amino acid formulas on the market.

If you are under a doctor’s care and need a peptide formula, check out our Peptide Formulas to learn about products, free samples, insurance coverage, and talking to your doctor about how peptides may be used in your treatment plan.

If you are not a medical patient and have come to our site just looking for great health knowledge, be sure to try nutritious and delicious Kate Farms Nutrition Shakes, featuring organic pea protein, powerful phytonutrients, and more.

## Machine-learning model helps determine protein structures

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images if one is not provided below, credit the images to "MIT."

Previous image Next image

Cryo-electron microscopy (cryo-EM) allows scientists to produce high-resolution, three-dimensional images of tiny molecules such as proteins. This technique works best for imaging proteins that exist in only one conformation, but MIT researchers have now developed a machine-learning algorithm that helps them identify multiple possible structures that a protein can take.

Unlike AI techniques that aim to predict protein structure from sequence data alone, protein structure can also be experimentally determined using cryo-EM, which produces hundreds of thousands, or even millions, of two-dimensional images of protein samples frozen in a thin layer of ice. Computer algorithms then piece together these images, taken from different angles, into a three-dimensional representation of the protein in a process termed reconstruction.

In a Nature Methods paper, the MIT researchers report a new AI-based software for reconstructing multiple structures and motions of the imaged protein — a major goal in the protein science community. Instead of using the traditional representation of protein structure as electron-scattering intensities on a 3D lattice, which is impractical for modeling multiple structures, the researchers introduced a new neural network architecture that can efficiently generate the full ensemble of structures in a single model.

“With the broad representation power of neural networks, we can extract structural information from noisy images and visualize detailed movements of macromolecular machines,” says Ellen Zhong, an MIT graduate student and the lead author of the paper.

With their software, they discovered protein motions from imaging datasets where only a single static 3D structure was originally identified. They also visualized large-scale flexible motions of the spliceosome — a protein complex that coordinates the splicing of the protein coding sequences of transcribed RNA.

“Our idea was to try to use machine-learning techniques to better capture the underlying structural heterogeneity, and to allow us to inspect the variety of structural states that are present in a sample,” says Joseph Davis, the Whitehead Career Development Assistant Professor in MIT’s Department of Biology.

Davis and Bonnie Berger, the Simons Professor of Mathematics at MIT and head of the Computation and Biology group at the Computer Science and Artificial Intelligence Laboratory, are the senior authors of the study, which appears today in Nature Methods. MIT postdoc Tristan Bepler is also an author of the paper.

Visualizing a multistep process

The researchers demonstrated the utility of their new approach by analyzing structures that form during the process of assembling ribosomes — the cell organelles responsible for reading messenger RNA and translating it into proteins. Davis began studying the structure of ribosomes while a postdoc at the Scripps Research Institute. Ribosomes have two major subunits, each of which contains many individual proteins that are assembled in a multistep process.

To study the steps of ribosome assembly in detail, Davis stalled the process at different points and then took electron microscope images of the resulting structures. At some points, blocking assembly resulted in accumulation of just a single structure, suggesting that there is only one way for that step to occur. However, blocking other points resulted in many different structures, suggesting that the assembly could occur in a variety of ways.

Because some of these experiments generated so many different protein structures, traditional cryo-EM reconstruction tools did not work well to determine what those structures were.

“In general, it’s an extremely challenging problem to try to figure out how many states you have when you have a mixture of particles,” Davis says.

After starting his lab at MIT in 2017, he teamed up with Berger to use machine learning to develop a model that can use the two-dimensional images produced by cryo-EM to generate all of the three-dimensional structures found in the original sample.

In the new Nature Methods study, the researchers demonstrated the power of the technique by using it to identify a new ribosomal state that hadn’t been seen before. Previous studies had suggested that as a ribosome is assembled, large structural elements, which are akin to the foundation for a building, form first. Only after this foundation is formed are the “active sites” of the ribosome, which read messenger RNA and synthesize proteins, added to the structure.

In the new study, however, the researchers found that in a very small subset of ribosomes, about 1 percent, a structure that is normally added at the end actually appears before assembly of the foundation. To account for that, Davis hypothesizes that it might be too energetically expensive for cells to ensure that every single ribosome is assembled in the correct order.

“The cells are likely evolved to find a balance between what they can tolerate, which is maybe a small percentage of these types of potentially deleterious structures, and what it would cost to completely remove them from the assembly pathway,” he says.

Viral proteins

The researchers are now using this technique to study the coronavirus spike protein, which is the viral protein that binds to receptors on human cells and allows them to enter cells. The receptor binding domain (RBD) of the spike protein has three subunits, each of which can point either up or down.

“For me, watching the pandemic unfold over the past year has emphasized how important front-line antiviral drugs will be in battling similar viruses, which are likely to emerge in the future. As we start to think about how one might develop small molecule compounds to force all of the RBDs into the ‘down’ state so that they can’t interact with human cells, understanding exactly what the ‘up’ state looks like and how much conformational flexibility there is will be informative for drug design. We hope our new technique can reveal these sorts of structural details,” Davis says.

The research was funded by the National Science Foundation Graduate Research Fellowship Program, the National Institutes of Health, and the MIT Jameel Clinic for Machine Learning and Health. This work was supported by MIT Satori computation cluster hosted at the MGHPCC.