# Autonomous Control

Applications in engineering often require complex and time-consuming numerical computations. Finite element methods (FEM) used for solving Partial Differential Equation (PDE) problems are a typical example. If the PDE problem has some degrees of freedom regarding form or material, it is not always feasible to compute solutions for a number of different choices in order to find the best one for production.

Machine Learning methods can help to provide a recommender system that assembles promising proposals for arbitrary parameters in PDE problems. These proposals can then additionally be refined with just a few classical computations.

Further applications of modern learning techniques lie in the construction of intelligent, automated process control units, which is demonstrated by the following illustrative example.

## Illustrative Example

Take a technical component that consists of a hexagonal part and two tubes of different diameter passing vertically through it. The geometry of this part is shown in the figure below.

When heat runs through the tubes the whole part is gradually heated up too. The total amount of heat to which the tubes are exposed to is constant, but the distribution of heat on both of the tubes varies randomly over time.

Assume now that the maximum heat taken over all six side faces of the hexagon, has to be kept below some threshold in order to maintain the functionality of the part. To achieve this, a cooling unit can be placed on either the left side or the right side of the part (as indicated in the plot below). Furthermore, a low cooling or a high cooling degree can be chosen.

The only source of information about the temperature of the part is a measurement unit on the top of the hexagon close to the left tube (marked by the red dot in the plot below). The geometry of the part. The cooling unit can be either placed on the left or the right side. The red dot indicates where the observable temperature is measured.

The heat transfer of the part is computed by a finite elements approach. The effect of cooling is modeled by a heat flux boundary condition, which is reset at each time step, depending on the side and the level of cooling.

## Learning Problem

The control problem described above is an alternation between updating the state (i.e. the temperature) of the part and the action of the controller affecting the state in the next step. The figure below illustrates the conceptual framework.

• The state is determined by the (unobservable) maximal temperature of all side faces, and the (observable) temperature metered by the measurement unit on the top.
• The four admissible actions are to place the cooling unit either on the left or the right side, and to choose a low or high level of cooling.

The action of the controller and exogeneous effects from the randomly changing heat in the tubes induce a new state in the next time step and a reward returned to the controller. The reward in our example is set to a large negative value when the part breaks due to exceeding the maximum allowed temperature, and a small positive value depending on the cooling level otherwise. Since cooling generates costs, high cooling yields a lower positive value than low cooling.

An autonomous controller needs to find a set of rules, called policy, that determines which action (denoted by a) to choose, given the current, observable state (denoted by s). The optimal policy is the policy that maximizes the expected cumulated rewards over time.

## Reinforcement Learning

Reinforcement Learning is an algorithmic concept to learn optimal behavior from experience. A controller (or agent) learns to distinguish between “good” and “bad” actions from experience.

This requires the agent to try many different control strategies and leads to one of the major challenges for any learning approach: the “exploitation versus exploration” dilemma. It refers to the objective of improving some given policy by following either of two strategies in choosing the next action: firstly, choosing the optimal action identified so far (“exploitation”), or, secondly, choosing an action that seem to be suboptimal now, but might lead eventually to a policy that generates higher cumulated rewards (“exploration”).

Many different concepts exist to resolve that dilemma. The learner used below chooses most of the time the (supposedly) optimal action, but occasionally jumps to a randomly chosen action. This policy is called “ɛ-greedy”, where „ɛ: refers to a probability jumping to an arbitrary action and “greedy” means always following the action that is assumed to be optimal at the time.

Lastly, a measure for “good” and “bad” actions needs to be specified. The common approach is to use the future expected cumulated rewards as a measure for the goodness of an action. This leads to the learning objective of finding the expected subsequent cumulated reward for any given state and action.

Once these values are learned, the optimal policy can be easily derived by being greedy: given a state, choose the action that promises the highest subsequent cumulated rewards.

## Results

Reinforcement learner may need many iterations to converge to an optimal control eventually. A set of iterations, that either ends in a terminal state with a broken part or has reached a maximum number of time steps, is called episode. The example used here for demonstration is rather simple and an efficient behavior was learned after a few hundred episodes, using a Reinforcement Learning approach called SARSA.

The controller has asymmetric information regarding the heat on the sides of the hexagon, because the measurement unit is placed close to the left tube. The temperature transferred by the right tube, mainly to the right half of the hexagon, has to be estimated somehow.

It is interesting to see how the Reinforcement Learner has solved this problem, and the asymmetry is reflected in the final policy given in the table below.

• As long as the measured temperature is below 91.7% of the maximum allowed temperature, the cooling focuses on the right side, with increased cooling as the temperature gets closer to 91.7%.
• The only exception in these lower states constitutes the case ”88.3% – 90.0%”, where a low cooling is put on the left side. This could be interpreted as the attempt to keep the temperature on the left side under control, which allows the agent to continue to focus on the more complicated right side.
• When the temperature rises above 91.7% and gets closer to the threshold, the cooling concentrates exclusively on the left. This makes sense, because the temperatures on the left sides are better represented by the observed temperature.

It is also interesting to note, that the extreme case of a temperature between 98.3% and 100.0% was never reached in the final episodes when the learned value function has largely converged. Hence, the learning technique has found a policy that avoids costly cooling on the left.