A New Era in Multistep Enzyme Design
Enzymes are an interesting class of proteins that perform a wide range of functions important to human health, but also important to life and ecosystems in general. Enzymes perform something known as "catalysis", consuming some substrate, often small molecules, and producing a catalytic reaction that modifies the substrate in some way. Enzymes are important to many biological functions and industrial processes, and some can perform a wide range of catalytic reactions. Designing new enzymes is quite tricky though, with nuances involving side chain conformations and medium scale dynamics, that need to be considered when engineering new enzymes. Luckily, several new tools have recently been developed which is improving success rate dramatically, and mitigating and reducing the difficulties involved in the design process. In this post, we will have a look at some of those tools, how they could (and should) be used together in a simple enzyme design pipeline, and some of the unique properties of enzymes that need to be considered when designing new de novo enzymes.
Motif Scaffolding with RFdiffusion or FrameFlow
Since RFdiffusion, there have been multiple newer AI models which can perform motif scaffolding, such as FrameFlow. FrameFlow as a tool is slowly catching up to the capabilities of RFdiffusion, using the more recent "flow matching objective", which ends up being faster at inference, and simpler to implement and train. Motif scaffolding capabilities allow us to choose one or more motifs from a known or predicted protein, then generate a new scaffold for them that holds them in place. The idea here is to pick functionally important motifs such as active sites and binding sites, and build a new protein scaffold around them so that we can design new enzymes. Doing this is somewhat nuanced though, and involves filtering and ranking designs based on various criteria, which we will discuss.
<video controls autoplay src="
">RFdiffusion, as many of you may already know, has a "substrate potential", which can be used to discourage clashes with the substrate or with small molecules in general. However, in the Results section of Computational design of highly active de novo enzymes, we have the following,
"For enzyme design, vanilla RFdiffusion implements an auxiliary potential to mimic the physical presence of a substrate. We found that this potential decreased the number of clashes between ligand and backbone, but did not promote the formation of well defined substrate binding pockets, leading to a trade-off between clashes and substrate interactions. For this reason, we added another alpha-helix as an entry-channel placeholder toeach artificial motif at the position of the binding pocket prior to motif scaffolding. We then implemented a custom auxiliary potential, which places the center of the denoising trajectory on the helix and enforces a distance constraint of all backbone atoms to this center. The ‘placeholder’ helix is removed after diffusion, leaving a vacant binding pocket. This procedure allowed us to drop the substrate auxiliary potential and enabled diffusion of protein backbones with custom pockets."
Sequence Design with LigandMPNN
Typically, at this point, we would like to design a protein sequence that folds into the desired shape. This is often done with models like ProteinMPNN, while holding some of the most evolutionarily conserved residues fixed, such as those directly involved in catalysis. However, ProteinMPNN does not consider non-protein ligands such as small molecules and substrates for enzymes. Luckily, a newer version of ProteinMPNN has been released and is capable of understanding non-protein ligands. LigandMPNN therefore understands something about binding to non-protein ligands, and having this extra information as context can be very useful for enzyme sequence design.
Structure Prediction, Validation, and Ranking
At this point we need some way of validating and ranking our designed enzyme sequences. In particular, we need to know if the predicted structure obtained from a protein folding model like AlphaFold, matches well with the designed structure that RFdiffusion returns after the motif scaffoling step is complete. To do this, we simply predict the structure of the designed sequences, then compute the RMSD between the designed scaffold from RFdiffusion and the predicted structure of our designed sequences. If RMSD is low, below one or two Ångströms, then the sequence that we designed is likely to fold into the desired shape. Additionally, if we use sequences with pLDDT hgiher than about 80, then we restrict ourselves to designs that have higher confidence predictions from AlphaFold.
Using ChemNet to Filter Side Chain Ensembles
The previous step, while often used by protein engineers, is generally insufficient to filter and rank our enzyme sequence designs. To get a better filtration and ranking, we need to consider more than just the position of the backbone Cα atoms. We also need to consider the side chains. To do so, Baker Lab recently trained a new model called ChemNet and introduced it in the paper Modeling protein-small molecule conformational ensembles with ChemNet.
"Figure 1. Overview of ChemNet. A) ChemNet is a denoising neural network which takes at input a partially corrupted protein structure and the chemical structure (but not the coordinates) of any interacting molecules, and predicts the all-atom structure of the complex, as well as the uncertainties in the atom positions in the generated model. B) ChemNet can be used for a wide range of tasks including docking of small molecules and metals to a protein target, modeling non-standard residues, and predicting side chain conformations of amino acid residues and nucleotides at the protein-DNA interface. Shown are x-ray structures (in gray) superimposed with ChemNet models (in blue and orange). C) At input, the molecular system is represented by an annotated graph where nodes are individual atoms and edges are chemical bonds between atoms. Information about chiral centers is supplied to the network as (O,A,B,C) tuples where O is the central pyramidal or tetrahedral atom and its neighboring atoms A,B,C are ordered clockwise. D) ChemNet is a three-track network that iteratively updates 1D and 2D embeddings and the 3D structure, producing at each iteration a refined atomic structure model and estimating uncertainties in atom placements. E) The triple product , of the three unit vectors pointing from the central chiral atom to its neighbors (gray arrows) is a pseudoscalar that differs in sign for the R and S configurations: for ideal tetrahedral geometry . By comparing in the non-ideal geometry of the modeled structure to the ideal values or and taking gradients w.r.t. atom coordinates, one gets biasing vectors . showing the directions in which atoms should be moved around in order to recreate the desired configuration. F) All-atom FAPE is calculated by aligning the model and the reference structures on every three respective bonded atoms , and , and calculating the deviations in atom positions between the aligned structures. is then the mean over all atoms and all superimpositions. Atom-atom distances are clamped at 10Å. G) Assuming that uncertainties in atom positions in the modeled structure are normally distributed, we let the dedicated head of the network predict the variances , for every atom in the system to recapitulate actual deviations . These variances are learned during training by maximizing the likelihood ."
Now, it is unclear whether or not something akin to the FAFE loss, first used in FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames, would be able to be used in place of the all-atom version of FAPE which is used in ChemNet, but with the results of simple LoRA finetuning of AlphaFold2.3 with FAFE loss instead of FAPE loss being so good, it would be short-sighted not to mention it. If a version of FAFE loss can be substituted for FAPE, and ChemNet is finetuned with LoRA, performance is likely to improve.
Now, using ChemNet, we are able to dock small molecules and substrates to our newly designed enzymes, and subsequently study the ensemble of side chain conformations, which, like the need to preserve the dynamics of regions within some neighborhood of the active sites, is important to enzyme functioning and catalysis. This gives us a second, more robust method of filtering and ranking our designs, which is very important when running high throughput workflows and generating thousands of de novo candidate enzymes.
Sora for Molecules: Comparing Boltzmann Distributions with MDGen and ENCORE
In the paper Conserved conformational dynamics determine enzyme activity, the authors discuss the possibility that enzyme activity, that is, how effective an enzyme is at catalyzing a specific reaction, is encoded in the dynamics of the protein. This is not terribly surprising considering catalysis is a dynamic process, involving complex but very precise motions of the enzyme that do things like cleaving apart substrate molecules and producing simpler products. The thing to note here is that homologous enzymes often exhibit different catalytic rates despite fully conserved active sites and general binding sites. In other words, although the most important sites or regions on the protein for catalysis (the active sites and binding sites) can be conserved and exactly the same between two variants of the enzyme, the two variants my have wildly different rates of catalysis. The amount of substrate a particular enzyme variant can chew through in a fixed amount of time might change dramaticaly if we mutate active sites or binding sites, but residues that are located far away (in 3D space) from where the catalysis takes place, are also important. In the paper, the authors mention that residues surrounding the PTP1B active site promote dynamically coordinated chemistry necessary for the enzyme to function and perform catalytic reactions. However, residues distant to the active site also undergo distinct intermediate time scale dynamics and these dynamics are correlated with its catalytic activity and thus allow for different catalytic rates between the different variants in this family of enzymes.
Because of this, we need a way of analyzing the dynamics of the enzyme. We need to study the enzyme's "Boltzmann distribution", rather than simply studying static, low energy states given in the PDB or by a structure prediction model like AlphaFold. If we are to design enzyme variants, optimize known enzyme variants, or engineer de novo enzymes with some desired catalysis, we will need some way of comparing Boltzmann distributions and Boltzmann weighted ensembles. We need something that takes the place of RMSD (for comparing static structures), and instead compares two distributions. Now, at this point something like KL-divergence or Jensen-Shannon divergence should be coming to mind.
Now, there are multiple models that can be used to model the Boltzmann distribution. One that I've written about before is Distributional Graphormer (DiG). DiG, while useful, only models the N, Cα, and C atoms from the backbone without the Oxygen backbone atoms or side chains. This is sufficient if we simply want to sample the Boltzmann distribution of the backbone without the Oxygen atoms, and subsequently obtain a Boltzmann weighted ensemble. We can simply take the output from RFdiffusion and while designing sequences for that output, we also pack the side chains with LigandMPNN. We then provide this to DiG, and in a separate run we provide the predicted structure for the sequence in question. We then compare Boltzmann weighted ensembles using methods like those established in ENCORE.
Another approach involves generating actual trajectories, similar to the way movies are generated by Sora, using the recently released MDGen. MDGen is a new generative model out of MIT that can perform several related tasks. It is essentially Sora for molecules, conditioned on physics instead of text (but there's no reason we can't condition on text too!).
MDGen has the following functionalities:
- Forward simulation—given the initial frame of a trajectory, we sample a potential time evolution of the molecular system.
- Interpolation—given the frames at the two endpoints of a trajectory, we sample a plausible path connecting the two. In chemistry, this is known as transition path sampling and is important for studying reactions and conformational transitions.
- Upsampling—given a trajectory with timestep ∆t between frames, we upsample the “framerate" by a factor of M to obtain a trajectory with timestep ∆t/M. This infers fast motions from trajectories saved at less frequent intervals.
- Inpainting—given part of a molecule and its trajectory, we generate the rest of the molecule (and its time evolution) to be consistent with the known part of the trajectory. This ability could be applied to design molecules to scaffold desired dynamics.
Applications and Concluding Remarks
Using these models and methods, we can develop a powerful and robust enzyme design pipeline. At this point, the only thing holding us back from designing de novo enzymes with completely new to nature catalysis is a simple, generalizable method for constructing "theozymes", that is, 3D arrangements of active sites and binding sites, which can subsequently be grafted into a scaffold using models like RFdiffusion and FrameFlow. Constructing the 3D arrangement of things like active sites, oxyanion holes, binding sites, and the like, i.e. defining the 3D structure of the theozyme and encoding that in a CIF or PDB file, is a nontrivial endeavor. It requires deep knowledge of the multistep catalytic reactions, however, there is some promise that models such as OAReactDiff and EnzymeFlow will help in understanding such reactions and in constructing theozymes.
We will leave it to the reader to investigate further, but the future of enzyme design is looking very promising, with several high quality AI models now trained and available, all providing substantial capabilities to enzyme engineers. With current tools, we can now easily optimize known enzymes, improving their catalysis, function, expression rate, and thermostability. Additionally, we actually have the tools to design completely de novo enzymes from known theozymes, and we even have the tools necessary to do some de novo designs with completely new to nature catalytic reactions.
BioML workflows on the Lilypad Network
Building workflows that aid in Multistep Enzyme design and other important BioML tasks require robust infrastructure and tooling! In addition, large amounts of computing resources are needed on-demand to run these complex workflows. To help provide solutions and grow the BioML field, the Lilypad network works closely with the BioML research community to provide access to GPU compute and software solutions to build, test, and run ML workflows.
Lilypad is a serverless, distributed compute network that enables internet-scale data processing for AI, ML, and other arbitrary computation. Unleashing idle processing power by leveraging decentralized infrastructure networks, Lilypad unlocks a new marketplace for compute, making AI more accessible, efficient, and transparent for developers and users.