After Beating Physics at Modeling Atoms and Molecules, Machine Learning Is Now Collaborating with It | by LucianoSphere | Dec, 2022

By Jessie Hobb On Dec 29, 2022

Bringing the best of both worlds together

Predicting the structures, motions, and reactivity of atoms, molecules, and materials is crucial in modern science, as they are directly related to their properties and behavior, and hence applications. Traditionally, the study of matter at the atomic level has been approached using physics-based methods, which rely on the principles of either classical or quantum mechanics—depending on the level of detail intended and on the questions asked. Whatever the exact type of simulation, in these approaches the goal is always to try to describe reality as accurately as possible and as required to answer the question of interest, by parametrizing this reality in a way that mimics the physics of the studied system including its evolution over time and space.

For example, if one wants to study how the 3D structure of a molecule changes over time without breaking or forming any bonds, i.e. simply by the temperature-induced vibration of atoms whose connectivities don’t change, the technique of choice is some form of atomistic molecular dynamics simulation. Over sufficient time, the fast vibrations of the atoms couple into bond length and bond angle deformations, then dihedral angle rotations, and eventually bigger conformational changes. In biology, such kinds of simulations have tons of applications, but in practice fall very short because perfect parametrization of all the atoms, bonds, and interactions possible in such systems is very difficult, to not say impossible, and also because the events of interest happen in very long timescales compared to the femtosecond timescale used to integrate the equations of motion hence these simulations can today barely explore the fastest events.

At another, much bigger level, if the question of interest involves for example how a protein diffuses next to a membrane assuming that its shape doesn’t change much over time, then one can do a simulation where multiple atoms are modeled together as coarse beads, a so-called coarse-grain model. The level of granularity goes from a few atoms per bead to whole proteins per bead; again depending on the question being asked. But still, the physics must be described somehow, and the equations of motions need to be propagated, therefore many questions are simply intractable.

On the other side of the spectrum of sizes, one may want to model the distributions and energies of all the electrons, or of a group of electrons, in a molecule or, say, a piece of material. This requires some kind of quantum mechanics calculation, which can answer questions that involve bond formation and breaking, or the transitions that electrons undergo in certain situations. These simulations are even harder to set up, parametrize, and run than atomistic simulations, and today they can reach even smaller size- and time-scales than classical simulations.

As you may have built up the idea from the preceding paragraphs, simulations grounded in physics trying to accurately describe reality are very powerful, and also very limited. Therefore, although in principle these methods should allow us to accurately predict any property of any molecule or group of molecules correctly, in practice they often fall very short. As a summary of the above explanations, this is mainly because of two reasons: (i) these simulations rely on numerous assumptions and approximations that we need to know because we cannot model reality exactly as it is, and (ii) they are often limited by the available computing power, such that even if reality could be simulated to the exact detail, computing power would still be such a limitation that in practice little would change relative to what today simulations can offer.

As a result, in practice these physics-based methods are very useful only for certain questions in some areas, while they totally struggle to accurately capture the complexity and subtlety of others. Just to mention one example close to my research, one would expect that a simulation should be capable of taking an extended protein and fold it into its right 3D structure. In practice, we are extremely far from this except for small proteins and using very sophisticated machines to run molecular simulations, as I described here:

The era of AI

In recent years, there has been a growing trend toward the use of artificial intelligence (AI)-based methods for predicting the properties of molecules at various levels of resolution, from electrons in atoms to biomolecules and bulk materials. These methods involve the use of machine learning algorithms trained with vast amounts of data on known molecular structures and properties, and then use what they have learnt from this training to make predictions about the structures and properties of new molecules and materials, or even to design some new ones.

Unlike physics-based methods, these AI-based approaches do not rely on any specific assumptions or approximations such as the parametrization of atoms and their interactions to describe reality. Instead, they rely on the power of big data and modern machine learning algorithms to make accurate predictions.

Coming back to the earlier example about predicting a protein’s structure by folding it using pure physics, this is a perfect example case of AI methods smashing physics-based methods. Even before AlphaFold came out in CASP14, the best protein structure prediction programs were already using ML methods to assist their calculations. I’ve discussed in particular how contacts between protein amino acids were predicted already in CASP12 and CASP13, and then used to fold proteins by mixing the ML-predicted contacts with partial physics-based descriptions of the proteins. At this time, and even earlier, pure physics-based methods ranked quite low in the CASP rankings. (To know more about CASP and these technologies, check my articles here).

Of course, one notable example of the success of AI-based methods in molecular structure prediction is the AlphaFold system developed by DeepMind and specialized for the prediction of (static) protein structures (and I stress static because that’s another important point where AI is now starting to assist physics, as I will discuss later on). AlphaFold 2’s predictions are superb, and boosted a revolution in modern biology and in the further development of new pure-AI methods to predict other protein properties, and now also mixed AI/physics-based methods that I will introduce below.

Overall, the triumph of AI-based methods over physics-based methods in molecular structure prediction highlights the potential of machine learning and artificial intelligence in the fields of chemistry, biology, and material sciences. By leveraging the vast amounts of data available, these methods have been able to make highly accurate predictions that were previously thought to be out of reach for decades if only traditional physics-based approaches were pursued. As a result, the use of AI-based methods is likely to continue to grow in the field of molecular structure prediction, and may lead to significant advances in our understanding of chemical and biological processes.

Physics-based models still wanted, for many reasons

But despite all the wonders of AI-based methods for molecules and the revolution they are fostering, there are many reasons why scientists would still prefer physics-based models. First, AI methods are hardly interpretable, which means they can work perfectly well yet remain as black boxes that prevent any real understanding of why they perform so well. This not only means a lack of fundamental understanding (like, what physics does the black box know, that we don’t?) but also a blind trust in a system that could actually fail inadvertently without users noticing.

Second, a true representation of reality, even if incomplete, means understanding how the relevant forces of nature and properties of spacetime actually work together to make our reality as it is. By having such a fundamental understanding rooted in the core of physics, it should be easier to translate the mathematical models that describe the systems of study into different times or places with slightly different properties, effectively extrapolating well outside of the training data used for an AI method, and still have some reasonable confidence that the simulations will make reliable predictions. For example, if we ever find life in another planet, it will have likely evolved under different conditions and hence their biomolecules will likely be totally different… AlphaFold would be useless to study these molecules, but a purported physics-based method for atomistic simulations that is sufficiently advanced and fast to be as reliable and useful as our current AlphaFold is for Earth biology, would be perfectly useful to study the biology of that other planet.

Third, even today, there are many problems of physics, chemistry, biology, engineering, etc. (largely most of their problems, I would say) that AI methods cannot treat at all, because there isn’t enough data to train them! When only small datasets are available relative to the complexity of a problem, properly grounded, explicit physical models can do much more than an AI method.

Maybe, someday pure physics-based methods will crack all this and allow us to simulate everything with perfect precision (note off scope: regarding this, I recommend J. L. Borges’ On Exactitude in Science). But right now, we are clearly still very far from solving everything up from master equations and fundamental laws, constants, and simple parametrizations.

However, there’s hope for the immediate future as AI-based methods coupled with more classical simulation methods, specifically by assisting and boosting the hardest calculations. By bringing the best of both worlds, science is now starting to evolve rapidly, with ML tools filling the voids left but analytical models and regular numerical methods and algorithms, but contributing to their own development. For example, I described earlier how ML-made predictions help to produce large numbers of artificial datasets of high quality, that human scientists can then put together through a process that ends up deriving analytical models and algorithms -a surprising way how the computer assists the human’s intellect:

More related to physics, I presented as well how symbolic regression considering physical restraints can rediscover known physics and discover new equations of use in physics and engineering:

In what follows, I will present several specific examples of research and tool development that have combined AI-based and physics-based methods, centered around 3 main topics/cases.

Case 1: Accelerating quantum mechanical calculations

Ma et al, Science Advances 2022 and Kirkpatrick et al Science 2021
As I also discussed in a dedicated blog entry, Google researchers recently employed AI methods and symbolic regression to speed up dramatically one specific step of the calculations involved in quantum mechanics calculations:

Deepmind presented a different solution to a similar problem in quantum calculations, also based in ML methods.

Smith et al Chemical Science 2017 and subsequent works
Another way how ML is being used to assist physics-based calculation is in replacing or complementing the forcefields that describe interactions between particles, mainly in classical simulations. Such simulations describe molecules by treating other atoms as soft spheres connected by mathematical descriptions akin to springs that mimic bonds, torsions along dihedral angles, charges that interact electrostatically, and other similar terms. These equations plus all the parameterizations they need constitute the so-called forcefields. Given a configuration of atoms in space and their connectivities, and under the equations and parameters of a given forcefield, a computer program can calculate the resulting forces on all atoms and then from this propagate Newton’s equations of motion over and over, producing a kind of movie of how the atoms’ positions evolve over time -called a molecular dynamics trajectory.

But what if can calculate these forces in some other way? One possibility would be to calculate them through quantum calculations, a technique that actually does exist and allows to inspect events such as bond formation and breaking, but is way too slow to propagate larger-scale motions.

Some modern neural networks provide the accuracy and some of the capabilities of quantum calculation methods, in a way adaptable to classical molecular dynamics simulations. Probably the most advanced such system is ANI, short for ANAKIN-ME after Accurate NeurAl networK engINe for Molecular Energies, developed by the Roitberg and Isayev groups at the U. of Florida and Carnegie Mellon U, U.S.A.:

ANI reads in atomic coordinates and symbols and returns energies, from whose gradients along the three dimensions of space one can also obtain the forces acting on atoms. Once forces are obtained, these can potentially be used to optimize molecular geometries and do other kinds of calculations, as the authors show. And potentially, with some adjustments that are still undergoing research and development, these forces could be used to drive molecular dynamics simulations replacing conventional force fields.

ANI works as a deep neural network trained on data produced for thousands of small organic molecules each in multiple different conformations and under different deformations, by using very detailed quantum mechanical calculations of the DFT type. The neural network itself is relatively simple compared to modern architectures using transformers, attention, diffusion and other recent elements. But it is blatantly efficient and fast: it can deliver energies and forces with near DFT precision but around a million times faster than through an actual DFT calculation.

To read atoms in, ANI uses a modified version of atomic environment vectors computed from the spatial arrangement of the atoms in the molecular system. Initially, ANI was trained on molecules containing only H, C, O and N atoms, and was then expanded to include S, F and P, which are very relevant to organic chemistry and biology (in particular, covering around 90% of drug-like molecules).

Through a series of case studies, ANI’s developers show that is is chemically accurate compared to reference DFT calculations on much larger molecular systems than those included in the training data set, which were rather small at the expense of having run millions of DFT calculations to produce all the training data. ANI provides energies and forces acting on atoms with DFT quality but around a million times faster, allowing a range of applications that were just unthinkable before. For several computational tasks revolving around molecular structures, ANI could potentially replace both quantum calculations and classical force fields. And for molecular simulations, ANI could fill in the gaps for example to simulate molecules that aren’t parametrized.

Apparently, the roadmap for ANI is to act as a transferable (i.e. general) forcefield that can be used to simulate the mechanics of molecules with DFT accuracy but at the speed of simple neural network propagations. Some of the ANI developers are also involved in the evolution of the AMBER program for molecular dynamics simulations, so we could see ANI incorporated into it in the future.

As another concrete example application, we are using ANI as a component of our VR tool for molecular structure manipulation to provide realistic physics in interactive molecular simulations, near DFT level but swiftly enough to run in real-time. You can see it in action in this short video:

Kulichenko et al in J Phys Chem Lett 2021
This perspective article focuses specifically in the use of ML to develop forcefields for modeling chemical processes and materials, rather than small molecules as in the works described above.

ML-based force fields trained on large data sets of high-quality electron structure calculations on materials, are attractive because of their combination of efficiency and accuracy. The authors highlight the importance of designing high-quality training data sets, and discuss strategies such as active learning and transfer learning to improve the accuracy of the models. They also provide examples of the application of these advances to molecules and materials.

Case 2: Mining protein conformations with AI

Degiacomi Structure 2019 and Ramaswamy et al Physical Review X 2021
If molecular simulations are difficult in general, molecular simulations of proteins are particularly challenging. Proteins are rather big molecules that undergo quite some flexibility and dynamics, feature internal contacts that are rather hydrophobic (akin to the contacts between oil molecules in an oil droplet) but also engage with interactions with water at their surfaces. Besides, their stability and structure are modulated by electrostatic interactions, aromatic stacking, hydrogen bonding between protein atoms and with the solvent, etc. That is, they are chemically and physically very complex, despite the apparent simplicity of their constituents. Simulating how proteins fold, diffuse, interact, change structure, or react, is thus extremely difficult. More so because even if we had a perfect forcefield, the events of practical interest usually lie in the tens of microseconds to milliseconds and even longer timescales, while current atomically detailed simulations cannot reach more than a few microseconds with conventional hardware and software.

Just as we saw above, ML-based methods could potentially help to speed up these simulations dramatically. Among the first works exploring this, there are two papers by the Degiacomi group at Durham University, UK.

First, the researchers demonstrated that a neural network trained on protein structures produced by MD simulations can be used to generate new, plausible protein conformations. They showed that this approach can be used in protein-protein docking scenarios, where the ability to account for flexibility helps to the broad hinge motions that occur in proteins upon binding -and that take very long computation time to take place in regular atomistic simulations.

This network is trained on a collection of alternative protein conformations produced through regular atomistic simulations, and tested on an independent set of conformations not used for training. The network consists of an encoder that takes in the atomic coordinates and passes them through a series of hidden layers with decreasing numbers of neurons to produce a low-dimensional representation of the input conformation. This signal then proceeds to the decoder, another series of hidden layers but this time with increasing number of neurons, which expands it back into an output that should be similar to the initial protein structures passed through the encoder. The whole autoencoder is initially trained to encode-decode structures so that the difference between input and output conformations is minimized. But after training, the decoder can be used to generate new protein structures from any coordinate within the latent space.

In subsequent work, the same group developed a convolutional neural network that learns not only about protein conformations but also about continuous how they exchange with each other, what’s called the conformational space. Moreover, this network used loss functions that ensure that the intermediates between conformations are physically plausible. This way, after training on snapshots from regular atomistic simulations the network can predict how a given protein conformation transitions into another, i.e. how it moves in a physically reasonable way.

At the core, this network also includes an encoder and decoder that, just like in the previous work, compresses and decompresses the information creating structural fluctuations along the way. But in-between the encoder and the decoder, this network adds a module whose loss function contains physics-based terms that ensure the latent space interpolations between any pair of conformations produce protein structures of low energy. The physics-based loss function was built based on one of the AMBER force fields for classical atomistic simulations (more specifically AMBER ff14SB), taking separately its bonded and non-bonded terms. By using this loss, the internal module of the network enforces physics to be respected at all points along the geodesic connecting the input and output protein conformations, and constraining the intermediates to follow a locally minimum energy path.

One important point is that since the network can transfer features learned from one protein to others, this architecture could eventually grow into a transferable ML-based forcefield for the simulation of protein dynamics. Achieving this would open up tons of opportunities in structural biology and protein modeling.

Majewski et al. arXiv 2022
Still with proteins but at a lower level of resolution, Majewski et al just released a preprint describing coarse-grained molecular potentials based on artificial neural networks and grounded in statistical mechanics. They trained this neural network with 9 ms of cumulative atomistic simulations (mostly produced with this specialized computer) for twelve different proteins of various different structures. The resulting coarse-grained models accelerate the sampling of conformational dynamics by more than three orders of magnitude while preserving the thermodynamics of the systems as provided by the detailed atomistic simulations used for training.

Working towards a transferable ML-based forcefield, the authors further show that a single coarse-grained potential can be computed which can correctly model the dynamics of all twelve proteins, and even capture certain special features of mutated versions of these proteins.

Case 3: ML methods to finding reaction coordinates for free energy computation and enhanced sampling

Replacing forcefields or assisting physical calculations is not the only way how ML is assisting physics. A whole territory being explored is how ML methods can chart conformational landscapes and construct routes for simulations to navigate it efficiently. The earliest methods to do this involved mathematics like Principal Components Analysis, which is powerful and fast but is limited by its linear nature. Being much more versatile, modern ML tools can be much more helpful.

A recent review leveraged by several big groups in the field has explored this kind of application of ML to physics, together with some of the applications for boosting mechanics calculations that I covered earlier. The review discusses the use of machine learning techniques in molecular dynamics simulations for the construction of empirical force fields and the determination of reaction coordinates for free energy computation and enhanced sampling. These techniques have the potential to extract valuable information from the large amounts of data generated by the simulation of complex systems, but also have limitations that should be considered. The focus of the review is on the application of these techniques to materials and biological systems.

Case 4: ML adaptations to better model long-range physics

One important limitation of most ML methods dealing with molecules and materials is that their predictions usually build from rather local effects, say within a radius of X Angstrom of an atom, or up to n bonds away, both typically being small compared to the total sizes of the systems studied. To come with complete predictions, the locally calculated effects are often pooled together. In this process, long-range information might get lost.

A work by Grisaffi and Ceriotti in The Journal of Chemical Physics 2019 presents a possible way to circumvent this limitation, through a means to blend physical terms together with a ML model.

As introduced above, the properties of a large molecule or a bulk material likely involve computations that are summed up over the contributions from all atom-centered environments. This brings the explained downside that it cannot capture nonlocal, nonadditive effects such as. For example, effects arising from long-range electrostatics can involve distances much larger than those of the local spheres over which contributions are typically computed. The authors of this work tackled the problem with a framework that introduces nonlocal representations of the system that are remapped as feature vectors defined locally and equivariantly in space, and that hence can “connect” distant points of the molecule or material. The framework can in principle be adapted to any kind of long-range effect, as they demonstrate with various examples.

Globally, the work shows that combining representations sensitive to long-range correlations with the transferability of atom-centered additive ML models, can lead to better predictions of chemical and physical phenomena. That’s the exact idea this article intended to foster: how pure physics and ML-based models can advance science and engineering better than ever.

Besides the articles and blog entries I mentioned throughout the text…