Can Reinforcement Learning Generalize Beyond Its Training? | by John Morrow | Jan, 2023

By Jessie Hobb On Jan 24, 2023

A case study in model generalization

The project detailed in the paper, Reinforcement Learning: A Case Study in Model Generalization, explores the ability of a model trained with reinforcement learning (RL) to generalize, i.e., produce acceptable results when presented with data it was not exposed to during training. The application in this study is an industrial process with multiple controls that determine the effect on a product as it transitions through the process. Determining optimal control settings in this environment can be challenging. For example, when there are interactions between the controls, adjusting one setting can require the readjustment of other settings. Also, a complex relationship between a control and its effect complicates finding an optimal solution. The results presented here show that a model trained by an RL process performs well in this environment and is able to generalize to conditions different from those used for training.

The paper describes an RL model trained to find the optimal control settings for a reflow oven used for soldering electronic components to a circuit board (Figure 1). The oven’s moving belt transports the product (i.e., the circuit board) through multiple heating zones. This process heats the product according to a temperature-time target profile required to produce reliable solder connections.

Figure 1: **Circuit boards on oven belt** (*Image via Adobe under license to John Morrow*)

A human operator typically takes the following steps to determine the heater settings required to solder circuit boards successfully:

• run one pass of the product through the oven

• observe the resulting temperature-time profile from the sensor readings

• adjust the heater settings to improve the profile toward the target profile

• wait for the oven temperature to stabilize to the new settings

• repeat this procedure until the profile from the sensor readings is acceptably close to the target profile

Learning the policy

An RL system replaces the operator steps with a two-stage process. In the first stage, an agent learns the dynamics of the oven and creates a policy for updating the heater settings under various oven conditions.

Since considerable time is required to stabilize an oven’s temperature after changing the heater settings and passing the product through the oven, an oven simulator is used to speed up the learning process. The simulator emulates a single pass of the product through the heating profile in a few seconds instead of the many minutes required by a physical oven.

In each pass of the learning stage, the agent takes an action from its current state by sending the simulator new settings for the eight heaters. After the simulation run, the simulator reports back the product temperature readings (three hundred readings taken at 1-second intervals).

The agent is rewarded for its action based on the difference between the returned readings and the target temperature-time profile. If the difference for the current run is less than the previous run, the reward is positive; otherwise, it is negative. A subset of the readings determines the new state of the system. The agent starts the next pass of the learning stage by taking action from the new state.

Planning with the policy

In the second stage, the agent follows the learned policy to find optimal heater settings. These settings will produce the closest match between the actual product profile and the target temperature-time profile. Figure 2 shows the result of the agent following the policy to find optimal settings. The blue trace is the target temperature-time profile, and the red trace is the actual profile produced by the optimal settings.

Figure 2: **Example planning result Blue trace: target profile. Red trace: actual product profile.**

Reinforcement learning system

As discussed above, an RL system comprises an agent taking action in an environment to learn a policy to reach the target goal. The environment responds to each action with a reward indicating whether the action was good or bad toward achieving the goal. The environment also returns the state of the agent in the environment. The agent consists of two neural networks: the model network and the target network. The agent’s goal is to find heater settings that will produce a product time-temperature profile very close to the target profile. The environment is the reflow oven simulator. Figure 3 shows the components of the RL system, each of which is described in detail in the paper.

Generalization: state and reward definition

The state and reward definitions are critical to the RL model’s ability to generalize to new environments where the target profile and product parameters differ from those used during training. Specifically, both the state and reward are defined in terms of the relative difference between the product and target profile temperatures and normalized by the maximum range of allowed heater values.

State parameters are defined at the centers of the eight heater zones. Each state parameter is defined as the normalized difference between the temperature at the center of the product and the temperature of the profile at the center of each heater zone.

When the agent performs an action, the environment returns a reward indicating the effectiveness of the action in achieving the agent’s goal. The reward is based on whether the action reduced the total temperature difference between the actual temperatures and target profile. The state and reward functions are described in greater detail in the paper.

Results

Following are two of the paper’s test results from running the planning process on various configurations of product materials and temperature-time profiles. All of the tests were run with a model neural network trained with the following product and profile parameters:

Test 1: Baseline

Test 1 is a baseline for testing the model’s performance with the same parameters used to train the model. Following are the test 1 errors, the optimal heat zone settings, and the target profile vs. actual temperature-time plot:

**Test 1: Blue trace: target profile. Red trace: actual product profile.**

Test 6

Test 6 changes the product from FR4 to aluminum oxide (alumina 99%), changes the size of the product, and changes the profile. The oven parameter values are the same as used in the baseline of test 1, except that both the top and bottom heating elements are active. The following tables reflect the profile and product parameters used for this test (changes from the baseline training parameters are bold):

Following are the test 6 errors, the optimal heat zone settings, and the target profile vs. actual temperature/time plot:

Conclusion

This project demonstrates that a reinforcement learning system can provide solutions to control a complex industrial process. Specifically, a reinforcement learning system successfully learns the optimal control settings of a reflow oven used to solder electronic components to a circuit board. Further, once trained, the system can generalize to produce acceptable results in environments with different requirements from those used during training.

Link to paper: Reinforcement Learning: A Case Study in Model Generalization

All images, unless otherwise noted, are by the author.