Towards Auto-Generated Models. A brief history of Neural Architecture… | by Haifeng Jin | Nov, 2022

By Jessie Hobb On Nov 21, 2022

A brief history of Neural Architecture Search and beyond

An interesting paper was quietly released on arXiv in 2016 by a group of research scientists from Google Brain, which used machine learning (ML) to create new ML models. No one expected that it would raise every possible controversy about ML research and put Google on the spot in the next three years. Then, it took another three years for a Turing award winner at Google to fully analyze the issue and make a formal response.

As a Ph.D., who completed a dissertation on this topic, I will walk you through all the interesting ideas and dramas that happened in this field with this article.

Breakthrough: The initial paper

Reinforcement learning can teach a smart agent to perform many tasks, like playing chess, or arcade games. Almost anything, as long as there is a clear feedback loop to evaluate the agent’s performance, can be learned by the agent. What about designing ML models? We can easily evaluate a machine learning model. That was the idea of the paper.

The title of the paper is “Neural Architecture Search with Reinforcement Learning” published in 2016. Instead of manually designing a new type of neural network architecture, like AlexNet, ResNet, or DenseNet, it uses reinforcement learning to automatically generate new neural architectures. As shown in the following figure, an agent, which is also a neural network, is trained to produce neural architectures. Each neural architecture is trained and evaluated. The evaluation result is the reward function to the agent. By producing new architectures and learning the performance of each one of them, the agent gradually learns how to produce good neural architectures.

In the paper, a new type of neural network, NasNet, was auto-generated by reinforcement learning and outperformed most of the rest of the neural architectures at the time on image classification tasks.

The paper also opened up a whole new research area in ML, named Neural Architecture Search (NAS), which is to use automated methods to produce new types of neural architectures.

The paper was game-changing. Seemingly, it was just another attempt at creating better neural architectures. However, it attempted to automate the work of an ML researcher. Many people fear that their jobs will be taken by ML. Well, the ML researchers don’t. On the contrary, they have been working hard to replace themselves with ML. If this idea can work, they will start to produce ML models that are beyond the reach of ML researchers.

Such a great attempt came at a price. It took 800 GPUs for Google Brain to run the experiments due to training and evaluating a large number of neural architectures. Such high cost was intimidating for most of the players in the field. No paper was published by any other group following this topic immediately.

Google Brain continued this line of research by expanding this automated design idea to other parts of the ML process, such as searching activation functions and searching data augmentations.

Breakthrough: Cost reduction

To lower the computing resources consumed by NAS, they published a new paper about one year after the initial paper. The title of the new paper is “Efficient Neural Architecture Search via Parameter Sharing” (ENAS). It reduced the cost of NAS to nearly the cost of training a single neural network. How did they do it?

The neural architectures generated by the agent may not be so different from one to another. In the previous work, they had to train every neural architecture end-to-end even if they were very similar to each other. Can we reuse the weights in the previously trained neural architectures to speed up the training of the neural architectures generated later? That is the idea of this paper.

They still use an agent to generate new neural architectures. The efficiency is improved by changing how they train each neural architecture. Every neural architecture is warm-started using weights from previously trained neural architectures. With this improvement, the total cost of training all the neural architectures generated by the agent almost equals the cost of training a big neural architecture.

Breakthrough: Neural architectures are differentiable

Soon after the paper above, some other research groups also started to join the game. They were also aiming at making NAS more efficient but from a different aspect. Instead of improving the training and evaluation, they improved the agent.

From discrete mathematics, we know that graphs are discrete structures. Neural architectures are computing graphs, so they are discrete. The agent may not easily perceive the correlations between different neural architectures and their performance due to their discreteness. What if we can make the neural architectures differentiable? Then, we no longer need an agent to generate the neural architectures. We can directly use gradient descent to optimize a neural architecture just like how we train a neural network.

It sounds like a crazy idea, but the authors of “DARTS: Differentiable Architecture Search” and “Neural Architecture Optimization” found ways to do it and implemented them. We will not go into the details here, but you are welcome to read the papers yourself.

Other attempts

As more researchers contributed to this topic, it got harder to make breakthroughs on the original NAS problem. They started to make improvements based on these differentiable solutions or to add constraints to the problem. Among these papers, some high-quality ones are worth reading, including SNAS, ProxylessNAS, and MnasNet.

Controversy: NAS doesn’t make sense

As we all thought this topic had reached its plateau and nothing big would happen in this field, a huge controversy arose to destroy the entire research area.

Imagine this situation. If I only allow the agent to generate neural architectures very similar to the ResNet, the generated neural architecture is guaranteed to perform similarly to ResNet. In this case, can we say the agent is smart enough to find good neural architectures? No matter how dumb the agent is, as long as the researcher is smart enough to constrain the search space to a small grasp of good neural architectures, the results will be good. Did the authors of the NAS papers employ this trick? A paper was published just to test it out.

The title of the paper is “Random Search and Reproducibility for Neural Architecture Search”. They replaced the reinforcement learning or whatever search algorithm used in the NAS papers with a random search algorithm.

The results are shocking. Random search is as good as any other search algorithm, or sometimes even better. This conclusion invalidated the entire research field of NAS.

If this conclusion in the paper is true, NAS is just a more fancy way to manually design neural architectures. It can never become self-sufficient in ML model designing since it can never find good neural architectures without the carefully designed search space by an experienced ML researcher.

Breakthrough: Discover models from scratch

As we all thought the NAS research was over, it had a remarkable comeback with another game-changing paper, “AutoML-Zero: Evolving Machine Learning Algorithms From Scratch”, which shed new light on whether ML can do the work of generating new ML models.

Unlike previous NAS papers, the neural architectures are no longer limited to a carefully human-designed search space. It does not even search neural architectures. It builds ML models from scratch.

It uses evolutionary algorithms to combine basic mathematical operations, like addition, multiplication, or uniform sampling. Use these combinations as ML models, and evaluate their performance on the datasets to select the good models. With this setting, almost any ML model is included in the search space.

The results are amazing. Many widely used ML models were generated by the algorithm. For example, it rediscovered linear regression, stochastic gradient descent, ReLU, and 2-layer neural networks. The search process was like replaying the ML research history.

Although the paper did not come up with better ML models, it proved the concept of auto-generated ML models. It is possible, at least on a basic level, for ML to generate usable ML models.

Controversy: NAS vs the environment

Soon after the previous controversy, an even bigger crisis arose about NAS, which started with a paper named “Energy and Policy Considerations for Deep Learning in NLP”. It was highlighted by MIT Technology Review and drew a lot of attention.

The work investigated the energy consumption and carbon emission of conducting deep learning research, especially training large deep learning models for natural language processing (NLP). The biggest model mentioned in the paper was a Transformer model based on NAS, which contains 0.2 billion parameters. Training such a model would emit the same amount of CO2 as taking 300 passengers for a roundtrip from San Francisco to New York.

Deep learning VS. the environment. It is a fundamental question to answer as deep learning advanced at a crazy speed in recent years. Suddenly, a lot of criticism arose. Google was criticized for harming the environment by training large deep learning models. Will Google successfully address this PR crisis? Most importantly, will Google take the responsibility to solve the problem of eco-friendliness for deep learning?

As a response, Google committed its top talents to address the issue. David Patterson, a Turing award winner, led a group of genius research scientists to fully investigate the problem for almost three years. Finally, they came up with a response in confidence in 2022.

The title of the paper is “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink”. They proposed 4 best practices to reduce the carbon footprint and energy consumption of training deep learning models. By adopting these best practices, Google dramatically reduced its carbon emission for its deep learning-related activities. The amount of CO2 reduction in 2021 equals the CO2 emission of training a Transformer model 700 times in 2017.

Neural architecture search is not AutoML

Before we go to the conclusion, I would like to explain the confusion between NAS and AutoML since these two terms often appear together. AutoML even appeared in the name of “AutoML-Zero”. However, most of the NAS research is not AutoML.

People refer to NAS as AutoML because NAS is an automated way to create new ML models. NAS produces new ML models that generalize well to other datasets and tasks. However, from a practitioner’s perspective, they are not so different from ResNet except for better performance. They will still have to do data preprocessing, post-processing, hyperparameter tuning, and so on if they would like to use the model in practice.

AutoML is not about producing new ML models but about the ease of adoption of ML solutions. Given a dataset and a task, AutoML tries to automatically assemble a suitable end-to-end ML solution, including data preprocessing, post-processing, hyperparameter tuning, and everything. From a practitioner’s perspective, it dramatically reduces the knowledge requirements and engineering workload to adopt ML to solve their problems.

Conclusions

So far we have reviewed the history of auto-generated ML models. Now we need to answer the question: Can we rely on auto-generated ML models?

From the discussions above, as this field progresses, the solutions did get better and better. Especially, when AutoML-Zero came out, I felt it is on the right track. However, it has already exhausted today’s computing power just to find a simple 2-layer neural network. I cannot imagine how much computing resources it would need to produce models better than the state-of-the-art models today. Also, the financial benefits are not as clear as other ML research like the GPT-3 or the Stable Diffusion. Big companies are not likely to devote huge amounts of resources to it. Therefore, my answer is: It is not achievable any time soon.

The greater implication

Besides the conclusions above, these efforts have some greater implications for ML research in general that cannot be ignored. The papers we discussed above are so brute-force. They are so different from typical ML research.

What is the exact difference here? We used to build ML models with beautiful math equations and rigorously proved theorems. However, the NAS algorithms do not understand math at all. They treat ML models as black boxes and evaluate them solely by experiments.

The methodology of ML research starts to look like experimental sciences. Just like experimental physics, we discover new findings based on experiments. So, should we conduct ML research like math or experimental sciences? Is the math behind the ML models necessary? NAS raised these questions for all of us to answer, which will have a profound influence on future ML research.

In my opinion, creating ML models without strong math backup may be inevitable in the future. Math was a perfect world built by humans, where every conclusion was discovered by rigorous derivation. However, as the math world evolves, it gradually became as complex as the real world. In the end, we will have to rely on experimental methods to discover new findings in math just like what we do with the real world today.