Techno Blender
Digitally Yours.

“ML-Everything”? Balancing Quantity and Quality in Machine Learning Methods for Science | by LucianoSphere | Mar, 2023

0 52


The need for proper validations and good datasets, objective and balanced, and that predictions be useful in realistic scenarios.

Photo by Tingey Injury Law Firm on Unsplash

Recent research in machine learning (ML) has led to significant progress in various fields, including scientific applications. However, there are limitations that need to be addressed to ensure the validity of new models, the quality of testing and validation procedures, and the actual applicability of the developed models to real-world problems. These limitations include unfair, subjective, and unbalanced evaluations, not necessarily intentional yet there, the use of datasets that don’t properly reflect real-world use cases (for example that are “too easy”), incorrect ways to split datasets into training, testing, and validation subsets, etc. In this article I will discuss all these points, using examples from the domain of biology which is being revolutionized by ML methodologies.

Along the way I will also briefly touch on the interpretability of ML models, which is today very limited but very important because it could help clarify many of the aspects discussed in the first part of the article regarding the limitations that need to be addressed.

Besides, I’ll stress that while some ML models may be oversold, this doesn’t mean they aren’t useful or haven’t generated new knowledge that can help advance certain subfields within that of ML.

Many new ML models coming out every month, and this is just in my domain of work!

With the recent surge in the number of machine learning (ML) papers for scientific applications, I’ve started to ask myself… are all these works as revolutionary and useful as they propose? Sure, AlphaFold 2 was revolutionary, as well as subsequent ML tools inspired by similar theoretical frameworks, for example those for protein design like ProteinMPNN. But more broadly, what’s going on in the field? How can scientists be producing so many ML tools and they all be “the best” at the same time? Is the research behind them good enough? And assuming the work is novel and good, are their evaluations always fair, objective, and balanced? Are the actual applications to real-world problems as revolutionary as sold?

Each time I see a new ML method for structural biology, I wonder how to assess its merit and especially its actual performance and hence its actual utility for my research.

Researchers are increasingly applying the latest developments in neural networks to long-standing problems in various fields, leading to significant progress. However, it’s crucial to ensure that evaluations are fair, objective, and balanced, and that datasets and prediction capabilities accurately reflect real-world applicability of the ML models.

Social network posts, preprint servers and peer-reviewed papers show a burst in the application of modern neural network modules and concepts (transformers, diffusion, etc.) to long-standing problems in science. The mere fact that researchers are doing this is in itself good, because the approach has led to remarkable progress in various fields. Take, as the two main examples, advances in protein structure prediction led by AlphaFold 2 winning CASP14, and protein design especially by the hand of D. Baker’s ProteinMPNN whose predicted protein sequences were extensively tested in experimental work confirming that the method works. If you want to know more about these methods, check out my blog articles:

In many cases, the new methods are at least somewhat oversold. Take for example—and I think my suspicion is clearest here—the case of protein design. Most if not all recent works presenting modern ML models for protein design use as a success metric the sequence identity of the recovered sequence relative to that of the input protein structure. At first this might make sense, but on deeper thought and knowing about protein sequences and structures, it is obvious that a good sequence match doesn’t necessarily have to mean (at all!) that a protein will fold as intended. For example, it is very common that even single mutants of a protein can fail to fold miserably and crash out of solution, even though they are almost identical to the wild-type protein and essentially perfect regarding the “sequence identity” metric. Also, single mutants can sometimes induce fold swaps, resulting in a case of almost perfect sequence identity but no preservation of the structure. Conversely, it is not at all uncommon to find proteins that fold in pretty much the same way although their sequences are entirely different hence of very low sequence identity. In conclusion, sequence recovery is overly reasonable but *very limited* as a metric for actual success in protein design.

Sequence recovery is reasonable but *very limited* as a metric for actual success in protein design

For the moment, the only actually meaningful test in protein design is getting your hands dirty in the wet lab, producing the designed protein, and determining its structure experimentally to find that it matches the input structure -at least reasonably well, as of course we can’t expect a perfect match. While the ProteinMPNN papers and a handful of other works do this experimental validation of the method, most preprints and papers I see out there just totally overlook this, focusing exclusively on sequence recovery and similar metrics. And no, finding that AlphaFold can back-predict the structure fed into the design protocol isn’t a safe measure that the design works either! At best, it can be used to identify bad sequences.

If you want to know more about ProteinMPNN that I just mentioned:

An important problem, that I describe here in general terms because I don’t want to talk negatively about any specific work, is that many studies assess their models on datasets that are not good enough for the task. The two main problems I’ve seen are datasets that fail to reflect real-world applications of the ML model, and datasets that contain entries that overlap with the training data.

I’m not saying there’s bad intention. Training ML models requires very large datasets, so large that they are impossible to curate manually, and automated curation pipelines always have limitations. On the other hand, preprints and papers tend to show only those cases that illustrate positive cases of applicability of the ML model to some biological problem, hiding or overlooking cases that lack any biological sense, or are hard to interpret, or—and this is just plain bad scientific practice—that don’t match with what’s already known.

None of these problems is specific to the domain, but rather a particular manifestation of well-established problems of science in general: only positive results are deemed to contribute, hence negative and incorrect results do not abound in the literature although they are as relevant as positive results especially to avoid futile waste of resources and time. The publish-or-perish culture promotes the publication of mainly positive results, often decorated to claim overly novelty and supremacy. To know more about my thoughts on the problems of science especially regarding publishing, see this:

Given what I exposed above, it is in my opinion very likely that contests like CASP (or like CAMEO, or CAPRI for structure prediction, etc.) and studies dedicated to objectively benchmarking existing methods, contribute to advancing the field much more than the vast majority of papers reporting new models. In fact, I think this so strongly, that to me, ranking top at a contest like CASP or at an independent benchmarking study just rules out any paper describing the (presumably) cool results of a new method that didn’t perform well in the evaluation (although this doesn’t necessarily mean that it might not involve some new methods of potential future interest).

Contests like CASP (or like CAMEO, or CAPRI for structure prediction, etc.) and studies dedicated to objectively benchmarking existing methods, contribute to advancing the field much more than the vast majority of papers reporting new models.

A point to note regarding evaluations, specific to protein tertiary structure prediction post-AlphaFold 2, is the problem that now only small improvements are possible, thus complicating the comparison of different new methods—a problem that AlphaFold 2 of course didn’t face because the bar was not high at that moment. This is less of a problem in other fields where predictions are still poor to intermediate, like in drug design, docking, prediction of conformational dynamics, and other questions now that protein tertiary structure prediction is kind of solved:

As scientists explore alternative ways to tackle a problem—via ML as is the focus here—they may come up with solutions that might seem revolutionary and have great prospects, but ultimately fall short. This doesn’t necessarily mean that the new developments aren’t useful for certain applications, or that they haven’t generated new knowledge that is useful for future work and to help advance certain fields.

One example that represents this very well is the application of language models to predictions in structural biology. One of the very first methods reported to do this showed substantially lower performance than AlphaFold 2, but execution times orders of magnitude faster, which could bring in some benefits for certain applications:

Probably the most popular (and to me useful) of these language-based models for protein structure prediction, Meta’s ESMFold, came out soon after. Meta’s landing this new neural network together with a huge database of protein structure models precomputed with it broke into a short-lived hype, that I think was in itself a smaller yet substantial contribution to the revolution started by Deepmind’s AlphaFold 2 in the field:

However, although ESMFold is very fast and performs much better than previous protein language models for structure prediction, its predictions are a bit behind those produced by AlphaFold 2 (besides, ESMFold is limited regarding for example the use of custom templates or the prediction of structures for protein complexes). You can personally check the relatively poor performance of ESMFold (by today’s standards) in the official evaluation carried out during the 15th edition of CASP:

But don’t get me wrong. Meta’s model is very useful and has great potential, in practice having paved new roads towards the development of multiple new tools that use its protein language model. And I think the model is especially interesting from the viewpoint of the internal network’s mechanics for structure prediction, which could certainly evolve further in the future. For example, this joint work from the Baker lab and Meta showed that language models, though only trained on sequences, learn deep enough to design protein structures that go beyond natural proteins, even including motifs that are not observed in similar structural contexts in known proteins (even tested experimentally!):

One of the limitations when using ML, not only for structural biology but I’d say for most applications in science or engineering, is the lack of interpretability, or at least of explicit interpretability. Or in other words, the fact that ML models work largely as black boxes, thus being pragmatically useful for the tasks or predictions they were designed to execute but not providing many (if any at all) insights on how and why they perform so well (or bad, given the case).

Ideally, even for perfect models of presumably high accuracy and reliability, it is desirable to understand why and when a model performs well or poorly. Moreover, ideally one wants to get these explanations rooted in the fundamental science of the domain, say for example in terms of the underlying physics or chemistry, in somewhat explicit ways to connect the independent variables just like regular modeling uses equations that we humans can understand.

In structural biology in particular, we have the issue that ML models for protein structure prediction work very well but little is known about how they achieve such good predictions. Thus, it is not very clear whether they’ve learned about protein structure something that we don’t know about, or (I think more likely) that perhaps we know about but in ways that are too hard to quantify hence hard to apply to structure prediction in “analytic” ways. Furthermore, we don’t even know whether these ML methods for structure prediction are only excellent predictors of folded states or they can also predict intermediates along folding pathways, alternative conformational states, the structural propensities of intrinsically disordered regions, etc. My guess is they don’t, or at least not with high confidence, because by design they are biased to predict structures of well-folded states. An explicit explanation about how they achieve such good predictions for such states, and what kind of information is flowing through the system, could perhaps prove me wrong, and certainly help method developers to understand limitations and try to overcome them with improved models.

The lack of interpretability in ML models can be problematic in several ways. First, it can be difficult to diagnose and correct errors when a model is not performing as expected. This is of special concern when dealing with far-fetched extrapolations, say for example predicting the structure of a protein whose structure is actually too different from all known structures. Without an understanding of how the model arrives at its predictions, it can be difficult to know how to fix it and to assess the reliability of each prediction -although modern ML tools are increasingly incorporating metrics for prediction reliability.

Second, a lack of interpretability limits our ability to gain insight into the underlying physics, chemistry or biology that explains why a given system behaves in some way, even if we can correctly predict this behavior. Not a big deal for pragmatic applications, but certainly incomplete regarding the fundamental understanding that science seeks.

Finally, lack of interpretability can limit our ability to build trust in ML models. If we cannot understand how a model arrived at its prediction, we may be reluctant to use it in situations where accuracy is critical. In structural biology in particular, inaccurate models can lead to incorrect conclusions about the function of biological molecules and hinder the progress of all the associated studies and developments.

The point of interpretability in the context of this article is that more interpretable ML models could alleviate many of the problems associated with their building, training, and application to real-world problems; and possibly even identify potential problems before they show up upon application, thus improving quality in the balance between quantity and quality.

More interpretable ML models could alleviate many of the problems associated with their building, training, and application to real-world problems, and thus improve quality in its balance with quantity.

There are people working on the problem of the interpretability of ML models, including some working specifically in the context of scientific applications. I will soon write up a blog article about this here.

I read great posts by other bloggers and scientists while putting my article together; among which these are very recommended -although we do not always agree:


The need for proper validations and good datasets, objective and balanced, and that predictions be useful in realistic scenarios.

Photo by Tingey Injury Law Firm on Unsplash

Recent research in machine learning (ML) has led to significant progress in various fields, including scientific applications. However, there are limitations that need to be addressed to ensure the validity of new models, the quality of testing and validation procedures, and the actual applicability of the developed models to real-world problems. These limitations include unfair, subjective, and unbalanced evaluations, not necessarily intentional yet there, the use of datasets that don’t properly reflect real-world use cases (for example that are “too easy”), incorrect ways to split datasets into training, testing, and validation subsets, etc. In this article I will discuss all these points, using examples from the domain of biology which is being revolutionized by ML methodologies.

Along the way I will also briefly touch on the interpretability of ML models, which is today very limited but very important because it could help clarify many of the aspects discussed in the first part of the article regarding the limitations that need to be addressed.

Besides, I’ll stress that while some ML models may be oversold, this doesn’t mean they aren’t useful or haven’t generated new knowledge that can help advance certain subfields within that of ML.

Many new ML models coming out every month, and this is just in my domain of work!

With the recent surge in the number of machine learning (ML) papers for scientific applications, I’ve started to ask myself… are all these works as revolutionary and useful as they propose? Sure, AlphaFold 2 was revolutionary, as well as subsequent ML tools inspired by similar theoretical frameworks, for example those for protein design like ProteinMPNN. But more broadly, what’s going on in the field? How can scientists be producing so many ML tools and they all be “the best” at the same time? Is the research behind them good enough? And assuming the work is novel and good, are their evaluations always fair, objective, and balanced? Are the actual applications to real-world problems as revolutionary as sold?

Each time I see a new ML method for structural biology, I wonder how to assess its merit and especially its actual performance and hence its actual utility for my research.

Researchers are increasingly applying the latest developments in neural networks to long-standing problems in various fields, leading to significant progress. However, it’s crucial to ensure that evaluations are fair, objective, and balanced, and that datasets and prediction capabilities accurately reflect real-world applicability of the ML models.

Social network posts, preprint servers and peer-reviewed papers show a burst in the application of modern neural network modules and concepts (transformers, diffusion, etc.) to long-standing problems in science. The mere fact that researchers are doing this is in itself good, because the approach has led to remarkable progress in various fields. Take, as the two main examples, advances in protein structure prediction led by AlphaFold 2 winning CASP14, and protein design especially by the hand of D. Baker’s ProteinMPNN whose predicted protein sequences were extensively tested in experimental work confirming that the method works. If you want to know more about these methods, check out my blog articles:

In many cases, the new methods are at least somewhat oversold. Take for example—and I think my suspicion is clearest here—the case of protein design. Most if not all recent works presenting modern ML models for protein design use as a success metric the sequence identity of the recovered sequence relative to that of the input protein structure. At first this might make sense, but on deeper thought and knowing about protein sequences and structures, it is obvious that a good sequence match doesn’t necessarily have to mean (at all!) that a protein will fold as intended. For example, it is very common that even single mutants of a protein can fail to fold miserably and crash out of solution, even though they are almost identical to the wild-type protein and essentially perfect regarding the “sequence identity” metric. Also, single mutants can sometimes induce fold swaps, resulting in a case of almost perfect sequence identity but no preservation of the structure. Conversely, it is not at all uncommon to find proteins that fold in pretty much the same way although their sequences are entirely different hence of very low sequence identity. In conclusion, sequence recovery is overly reasonable but *very limited* as a metric for actual success in protein design.

Sequence recovery is reasonable but *very limited* as a metric for actual success in protein design

For the moment, the only actually meaningful test in protein design is getting your hands dirty in the wet lab, producing the designed protein, and determining its structure experimentally to find that it matches the input structure -at least reasonably well, as of course we can’t expect a perfect match. While the ProteinMPNN papers and a handful of other works do this experimental validation of the method, most preprints and papers I see out there just totally overlook this, focusing exclusively on sequence recovery and similar metrics. And no, finding that AlphaFold can back-predict the structure fed into the design protocol isn’t a safe measure that the design works either! At best, it can be used to identify bad sequences.

If you want to know more about ProteinMPNN that I just mentioned:

An important problem, that I describe here in general terms because I don’t want to talk negatively about any specific work, is that many studies assess their models on datasets that are not good enough for the task. The two main problems I’ve seen are datasets that fail to reflect real-world applications of the ML model, and datasets that contain entries that overlap with the training data.

I’m not saying there’s bad intention. Training ML models requires very large datasets, so large that they are impossible to curate manually, and automated curation pipelines always have limitations. On the other hand, preprints and papers tend to show only those cases that illustrate positive cases of applicability of the ML model to some biological problem, hiding or overlooking cases that lack any biological sense, or are hard to interpret, or—and this is just plain bad scientific practice—that don’t match with what’s already known.

None of these problems is specific to the domain, but rather a particular manifestation of well-established problems of science in general: only positive results are deemed to contribute, hence negative and incorrect results do not abound in the literature although they are as relevant as positive results especially to avoid futile waste of resources and time. The publish-or-perish culture promotes the publication of mainly positive results, often decorated to claim overly novelty and supremacy. To know more about my thoughts on the problems of science especially regarding publishing, see this:

Given what I exposed above, it is in my opinion very likely that contests like CASP (or like CAMEO, or CAPRI for structure prediction, etc.) and studies dedicated to objectively benchmarking existing methods, contribute to advancing the field much more than the vast majority of papers reporting new models. In fact, I think this so strongly, that to me, ranking top at a contest like CASP or at an independent benchmarking study just rules out any paper describing the (presumably) cool results of a new method that didn’t perform well in the evaluation (although this doesn’t necessarily mean that it might not involve some new methods of potential future interest).

Contests like CASP (or like CAMEO, or CAPRI for structure prediction, etc.) and studies dedicated to objectively benchmarking existing methods, contribute to advancing the field much more than the vast majority of papers reporting new models.

A point to note regarding evaluations, specific to protein tertiary structure prediction post-AlphaFold 2, is the problem that now only small improvements are possible, thus complicating the comparison of different new methods—a problem that AlphaFold 2 of course didn’t face because the bar was not high at that moment. This is less of a problem in other fields where predictions are still poor to intermediate, like in drug design, docking, prediction of conformational dynamics, and other questions now that protein tertiary structure prediction is kind of solved:

As scientists explore alternative ways to tackle a problem—via ML as is the focus here—they may come up with solutions that might seem revolutionary and have great prospects, but ultimately fall short. This doesn’t necessarily mean that the new developments aren’t useful for certain applications, or that they haven’t generated new knowledge that is useful for future work and to help advance certain fields.

One example that represents this very well is the application of language models to predictions in structural biology. One of the very first methods reported to do this showed substantially lower performance than AlphaFold 2, but execution times orders of magnitude faster, which could bring in some benefits for certain applications:

Probably the most popular (and to me useful) of these language-based models for protein structure prediction, Meta’s ESMFold, came out soon after. Meta’s landing this new neural network together with a huge database of protein structure models precomputed with it broke into a short-lived hype, that I think was in itself a smaller yet substantial contribution to the revolution started by Deepmind’s AlphaFold 2 in the field:

However, although ESMFold is very fast and performs much better than previous protein language models for structure prediction, its predictions are a bit behind those produced by AlphaFold 2 (besides, ESMFold is limited regarding for example the use of custom templates or the prediction of structures for protein complexes). You can personally check the relatively poor performance of ESMFold (by today’s standards) in the official evaluation carried out during the 15th edition of CASP:

But don’t get me wrong. Meta’s model is very useful and has great potential, in practice having paved new roads towards the development of multiple new tools that use its protein language model. And I think the model is especially interesting from the viewpoint of the internal network’s mechanics for structure prediction, which could certainly evolve further in the future. For example, this joint work from the Baker lab and Meta showed that language models, though only trained on sequences, learn deep enough to design protein structures that go beyond natural proteins, even including motifs that are not observed in similar structural contexts in known proteins (even tested experimentally!):

One of the limitations when using ML, not only for structural biology but I’d say for most applications in science or engineering, is the lack of interpretability, or at least of explicit interpretability. Or in other words, the fact that ML models work largely as black boxes, thus being pragmatically useful for the tasks or predictions they were designed to execute but not providing many (if any at all) insights on how and why they perform so well (or bad, given the case).

Ideally, even for perfect models of presumably high accuracy and reliability, it is desirable to understand why and when a model performs well or poorly. Moreover, ideally one wants to get these explanations rooted in the fundamental science of the domain, say for example in terms of the underlying physics or chemistry, in somewhat explicit ways to connect the independent variables just like regular modeling uses equations that we humans can understand.

In structural biology in particular, we have the issue that ML models for protein structure prediction work very well but little is known about how they achieve such good predictions. Thus, it is not very clear whether they’ve learned about protein structure something that we don’t know about, or (I think more likely) that perhaps we know about but in ways that are too hard to quantify hence hard to apply to structure prediction in “analytic” ways. Furthermore, we don’t even know whether these ML methods for structure prediction are only excellent predictors of folded states or they can also predict intermediates along folding pathways, alternative conformational states, the structural propensities of intrinsically disordered regions, etc. My guess is they don’t, or at least not with high confidence, because by design they are biased to predict structures of well-folded states. An explicit explanation about how they achieve such good predictions for such states, and what kind of information is flowing through the system, could perhaps prove me wrong, and certainly help method developers to understand limitations and try to overcome them with improved models.

The lack of interpretability in ML models can be problematic in several ways. First, it can be difficult to diagnose and correct errors when a model is not performing as expected. This is of special concern when dealing with far-fetched extrapolations, say for example predicting the structure of a protein whose structure is actually too different from all known structures. Without an understanding of how the model arrives at its predictions, it can be difficult to know how to fix it and to assess the reliability of each prediction -although modern ML tools are increasingly incorporating metrics for prediction reliability.

Second, a lack of interpretability limits our ability to gain insight into the underlying physics, chemistry or biology that explains why a given system behaves in some way, even if we can correctly predict this behavior. Not a big deal for pragmatic applications, but certainly incomplete regarding the fundamental understanding that science seeks.

Finally, lack of interpretability can limit our ability to build trust in ML models. If we cannot understand how a model arrived at its prediction, we may be reluctant to use it in situations where accuracy is critical. In structural biology in particular, inaccurate models can lead to incorrect conclusions about the function of biological molecules and hinder the progress of all the associated studies and developments.

The point of interpretability in the context of this article is that more interpretable ML models could alleviate many of the problems associated with their building, training, and application to real-world problems; and possibly even identify potential problems before they show up upon application, thus improving quality in the balance between quantity and quality.

More interpretable ML models could alleviate many of the problems associated with their building, training, and application to real-world problems, and thus improve quality in its balance with quantity.

There are people working on the problem of the interpretability of ML models, including some working specifically in the context of scientific applications. I will soon write up a blog article about this here.

I read great posts by other bloggers and scientists while putting my article together; among which these are very recommended -although we do not always agree:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment