It is sometimes hard to believe that less than a century ago we did not know the secret of biological inheritance. Scientists intuited that traits were transmitted by a slightly acidic substance, contained within mysterious denser cellular structures known as nuclei. This substance was later identified as DNA. It had a complex regulation, and an ambiguous code, known as degenerate, which allowed the translation of chemical information into functionalities. These functions are called genes and are mostly carried out by proteins.
It should be noted that all the protein diversity of known biology has been refined over millions of years, adapting first to geochemical cycles and then to metabolomic interactions themselves. This demonstrates the scientific difficulty of improving highly specialized processes over time scales that are unmanageable for humans
However, since the 1970s, protein engineering has been theorized, once the "secret of life" has been decoded. This was not possible until advances in molecular biology techniques - linked to technological and computer development - increased the ability to digitize the genetic code and made it possible to cut and paste regions of sequences, leading to the development of what is known as genetic engineering. Today, we see the limitations of these cuts and splices in protein design. While this makes it possible to modify functionalities of an organism through a collection of techniques (e.g., joining domains to transcription factors or other genetic units), it is a discipline that cannot by itself design proteins that are more efficient than natural proteins to carry out a particular process. The industrial need to make processes more efficient and cheaper, or to create new functions, leads synthetic biology to explore this synthetic protein pathway.
The design of anything implies the ability to control an event arising from the application of the designed object. This means that in designing a protein we expect to be able to have control over the function it ideally performs. The problem is that proteins are actually multimeric objects, with an infinite number of elementary building blocks called amino acids, each with its own physicochemical peculiarities. Thus, we have that hydrophobicity, net charge, spatial conformation and folding, three-dimensional domains, catalytic active sites, and thermodynamic stability respond to a multiplicity of factors that are difficult to compute. The evolution laboratory has it easier since it only needs time and trial and error. In our case, we need a system external to the molecular technique, which dictates which amino acids are the most suitable for the function we want to implement or improve. This is a titanic task that requires a holistic effort between mathematics, biology, physics, chemistry, computation, and artificial intelligence.
Is it even possible to predict folding?
The short answer is yes. The long answer is that predicting an emergent behavior from the amino acid sequence and its individual characteristics could be as complex as predicting the thinking of a human being solely from the information of each of its neurons. What we are dealing with are actually mathematical models of how the molecular world works, assumptions continually reaffirmed by testing our hypotheses. So, we must always assume some error in our understanding of the world, but it is pragmatic to assume it. To design proteins, there are many models that we can test assuming these errors, such as affinity for the target structure, sequence space, structural flexibility, or energy function. For example, the concept of sequence space is very interesting in directed evolution, because it generates a spatial representation of possible sequences for that protein and this artificial evolution consists of "scaling" our sequence towards a specific "peak", which corresponds to the function we want to enhance or create. Of course, the algorithms of these models are extremely complex and can be understood as an optimization problem. This physical-chemical and temporal versatility that proteins possess (pleiotropism, sometimes caused by allosterism) is mainly what makes their determination difficult.
A distinction must be made between protein engineering and protein design. While the former is based on genetic engineering with already existing sequences, the latter is about designing a protein de novo. This makes it much more difficult to find examples in the industry since there are many modified enzymes and proteins, but very few are designed and they are usually in the public sector. Recently, Costas D. Maranas developed an application to design an outer membrane porin in E. coli, achieving an accuracy of angstroms in pore size change.
Other examples are the design of protein biosensors, capable of detecting and emitting signals upon binding or interacting with its ligand, being this any compound for which we have designed it. Another quite ingenious example is what is known as Protein Resurfacing, which aims to preserve the folds and internal structures of a problem protein but renewing the most hydrophilic amino acids, i.e. those on the three-dimensional surface of the protein, although the residues of the modified amino acids do not have a linear correspondence in the sequence. This makes it possible to recycle foldings involved in processes of interest, modifying the properties of the surface residues to enhance or inhibit action.
Current Research & Startups
The market and expectation for this trend have increased exponentially in recent years. Thus, we have initiatives such as Cyrus Biotechnology, led by Baker, which aims to help researchers around the world with its Rossetta software. In a more traditional environment, we have Arzeda, a company that focuses on designing enzymes, including enzymes with functions not seen before in nature. The future is promising for this area, as never have biological functions been so amenable to being modified at our whim - without having to wait millions of years.