Reinforcement Learning for Control Systems: An Overview

Reinforcement learning (RL) offers a powerful framework for designing intelligent control systems that can learn and adapt to complex environments. By combining concepts from control theory and machine learning, RL enables the development of autonomous systems that can optimize their behavior in real-time. This article provides an overview of RL in the context of control systems, exploring its fundamental principles, algorithms, and applications.

Introduction to Reinforcement Learning

Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. Unlike supervised learning, which relies on labeled data, RL involves training an agent through interactions with its environment. The agent learns to maximize a reward signal by selecting optimal actions in each state.

Core Concepts of Reinforcement Learning

The main characters of RL are the agent and the environment. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return.

Agent: The decision-maker that interacts with the environment.
Environment: The world with which the agent interacts.
State: A complete description of the state of the world.
Action: A choice made by the agent.
Reward: A scalar value that indicates the immediate consequence of an action.
Policy: A strategy that the agent uses to determine which action to take in each state.
Value Function: A function that estimates the expected cumulative reward from a given state or state-action pair.

The Reinforcement Learning Process

A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time step t, the agent receives the current state and reward. It then chooses an action from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state and the reward associated with the transition is determined. Formulating the problem as a Markov decision process assumes the agent directly observes the current environmental state; in this case, the problem is said to have full observability. If the agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have partial observability, and formally the problem must be formulated as a partially observable Markov decision process. In both cases, the set of actions available to the agent can be restricted. When the agent's performance is compared to that of an agent that acts optimally, the difference in performance yields the notion of regret. Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off.

Fundamental Functions in Reinforcement Learning

Reinforcement learning employs two fundamental functions: the Policy function and the Value function. (There are some algorithms that only use value functions like Q-learning).

Read also: Setting Up Your RCA Remote

Policy Function

Policy function is the function that agent takes decisions in a given state. When the Policy Function is optimal, the agent would take decisions such that the action taken in this state would lead to maximum reward after a sequence of other decisions. (and other decisions are also chosen by the same optimal Policy Function).

Value Function

Value Function on the other hand just stands for the expected value of total collected reward at a given state. Bellman’s optimality equation is used to model the Value Function. Here the “gamma” is a number between (0,1) to weigh future experiences.

Value Function of a State s under a Policy Pi:

Action-Value Function for Policy Pi:

Reinforcement learning algorithms are based on the two fundamental equations above (Value Function and Action-Value Function,a.k.a Q Function). All RL algorithms try to find optimal Value Policy Functions that maximize the reward function defined for the problem.

Read also: Navigating the Skies

Connection Between Reinforcement Learning and Control Theory

The connection between RL and control theory lies in their shared goal of optimizing system behaviour through decision-making. RL techniques can be viewed as a modern approach to control theory, offering a framework for learning control policies directly from data. By incorporating concepts from control theory, such as state estimation, feedback control, and system dynamics modelling, RL can enhance the stability, performance, and adaptability of control systems.

This synergy between RL and control theory opens up exciting possibilities for creating intelligent and autonomous control systems that can learn and optimize their behaviour in real time.

Control Systems: An Overview

Control systems are a fundamental part of engineering and automation, aimed at regulating and manipulating the behavior of dynamic systems. They consist of a combination of hardware and software components that work together to achieve desired objectives. Control systems are operated by continuously measuring system outputs, comparing them to desired setpoints(reference), and using this information to calculate control signals that manipulate system states(altitude of a plane or speed of a vehicle).

The primary goal of control systems is to maintain stability, improve performance, and ensure system behaviour conforms to desired specifications. They can be found in various applications, such as robotics, industrial automation, power grids, and transportation systems. Control systems can be designed using different methodologies, including classical control theory, modern control theory, and intelligent control techniques. These systems play a crucial role in enhancing efficiency, safety, and productivity across a wide range of industries and sectors.

PID Controller: A Traditional Control Algorithm

The most famous and widely used control algorithm is the PID (proportional-integral-derivative) controller. In the PID controller, the control engineer must adjust P, I and D coefficients such that system achieve desired set point by satisfying following criteria:

Rise Time
Overshoot
Settling Time
Steady-State Error

Actually designing a controller with the least rise time, settling time, overshoot and steady-state error is never possible in real life since each criterion affects the other.

In this case, there is a branch called ‘Optimal Control’ that defines the control problem as an optimization problem and solves it via optimal control algorithms like ‘MPC’(Model Predictive Controller) and ‘LQR’(Linear Quadratic Controller).

LQR: Minimizing the Cost Function

So let’s first take a look at how LQR minimizes the difference between the setpoint and reference point:

LQR control uses dynamical system state-space representation which is X matrix and U vector and two matrices Q and R to penalize different terms to achieve optimal performance.

Q matrix penalizes each state’s distance from its reference state in the state-space model. For example, an adaptive cruise control (ACC) system would have stated in longitudinal coordinates like velocity, acceleration and jerk. A perfect ACC system would minimize jerk and tries to achieve the desired velocity with the least amount of acceleration. In this case, our reference state for velocity would be the velocity that maintains a safe distance, and the reference state for acceleration & jerk would be zero vector.

As you may already realise, designing such a system would need expertise in control engineering and a lot of design parameters with a strong mathematical foundation.

Bellman Equation

However a RL is famous by it’s bellman equation which maximizes total reward and then uses different algorithms to optimize the equation. So two equations LQR optimality equation and the Bellman Equations are very similar while the LQR minimizes the cost and the Bellman maximizes the reward!

Defining a reward function that maximizes difference between reference point and state of dynamical system would enable to use any kind of RL algorithm in control problems

Methodologies for Applying Reinforcement Learning to Control Problems

Output Feedback Reinforcement Learning

This is the most basic approach to control dynamical systems directly measuring the output and producing control output by minimizing the error. In the RL setting, once the algorithm maximizes the negative sum of absolute value error and its integral, it will able to control it! Understanding why I used the sum of negative absolute error is very important and gives some time to this concept.

State Feedback Reinforcement Learning

State feedback reinforcement learning is actually similar to output feedback reinforcement learning, the only difference is using a state-space representation of the system and trying to minimize the difference between state references for each state.

Why Using Integral of Negative Absolute Error ?

Reducing state-space error, which is the error once the system achieved a stable point however there is a significant difference between current output (state) and the reference output (state)

Why Choosing RL Over Optimal Control Methods?

The contrary of conventional control methods based on PID controllers, optimal controllers offer considerable amount of performance increase and reduce deployment time. PID controllers need cascaded schemes in most real-world problems and algorithms to reset integral error (integrator wind-up) while considering derivative error’s sensitivity against sudden changes. All those problems cause lots of hand-crafted solutions based on look-up tables and rule-based reset logic.

Reinforcement Learning Algorithms

Several RL algorithms can be applied to control problems, each with its own strengths and weaknesses. Some popular algorithms include:

Q-learning: A model-free RL algorithm that learns the optimal action-value function. Q learning is a model free reinforcement learning algorithm. It is named after the “quality function” at the heart of the method (and has nothing to do with the deep state or certain members of Congress). In particular, the Q function tells you what the total expected future reward is for choosing an action (A) in a state (S), assuming the best possible actions are chosen from there on out. If you knew this function, writing a policy would be easy: always choose the action with the highest expected reward. Sounds too good to be true, and of course we can’t assume a way to compute it a priori. But the interesting thing is that if we initialize a lookup table of Q values with random numbers, there’s a procedure we can use to iteratively improve it until we get arbitrarily close to the real values.
The key insight is that q values of successive states are related: if choosing action (ai) in state (si) leads us to state (s{i+1}) and gives us a reward (ri), then the following holds. [Q(si, ai) = ri + \max{a{i+1}} Q(s{i+1}, a{i+1})] The (\max) is taken over possible actions for state (s{i+1}). It’s how we bake in the notion of choosing the best actions in the future. Technically speaking, it’s only an exact equality if the process isn’t stochastic. Otherwise, perhaps we got unlucky with what state we ended up in, and the real expected reward is higher; or we got lucky and the real expected reward is lower. But if we throw in a learning rate and just push our estimated Q values in the direction suggested by this equation, it turns out that the numbers still converge to the correct values even for stochastic processes (under some assumptions on the learning rate and MDP distributions). In modern usage, the Q function is usually estimated by a deep neural network.
Deep Q-Network (DQN): An extension of Q-learning that uses deep neural networks to approximate the action-value function. This is the algorithm that first put DeepMind on the map (back before Google owned them) by learning to play Atari games (video). As another example, several years ago I trained a deep Q learning network to play a game inspired by CNC milling. The algorithm is given a target shape, and the current workpiece shape. At each time step it chooses a direction to move, and if it moves into a pixel occupied by the workpiece, that material is removed. If it moves into a pixel that’s part of the target shape, that’s a mistake and it loses. It solved small problems perfectly. One of these days I still want to get around to scaling it up.
Policy Gradients: A class of RL algorithms that directly optimize the policy function. With Q learning, once our estimated Q values start to converge, we can easily express our policy in terms of them. Policy gradient methods take a more direct approach, and seek to optimize a policy directly. As with Q learning, these days the policy is almost always represented by a deep net, and we want to train this network with some form of gradient descent.
The first tricky part is that we want to maximize the reward we receive, but we don’t know exactly what the reward function is. We just get samples from it when we perform actions. We certainly can’t compute the gradient of the reward function with respect to the parameters of our neural net. So how can we use gradient descent?It turns out that since we’re optimizing an expected value instead of a single value, we don’t actually need to differentiate through the reward function at all. We just need to differentiate through the policy, which is exactly what backpropagation on our neural network gives us. The math works out like this. Here you can take (f(x)) to be the cumulative total reward for some arbitrarily long amount of time, or as is often the case, the cumulative reward when the game ends. Similarly, (p\theta(x)) should be viewed as the probability of the whole series of actions that got us to that particular cumulative reward. (\theta) holds whatever we use to parametrize our policy network, so in practice the weights and biases in our neural net. [\begin{aligned} \nabla\theta \mathbb{E}{p\theta}[f(x)] &= \nabla\theta \sumx p\theta(x) f(x) \ &= \sumx \nabla\theta [p\theta(x) f(x)] \ &= \sumx \frac{\nabla\theta p\theta(x)}{p\theta(x)} p\theta(x) f(x) \ &= \sumx \nabla\theta [ \ln p\theta(x) ] p\theta(x) f(x) \ &= \mathbb{E}{p\theta}[f(x) \nabla\theta \ln p\theta(x)] \end{aligned}] The second tricky thing is that, in the expression above, (p\theta(x)) represents the probability of the whole chain of actions that gets us to a final cumulative reward. But our neural net just computes the probability for one action. This is where the Markov property comes into play. It means that the probability of ending up in some state depends only on the previous state and the chosen action - there’s no dependence on deeper history. So the probability of the whole chain of actions factors into the product of the probabilities of each individual choice we made.The key parts of this algorithm were first put together in the REINFORCE algorithm. Since then they’ve been expanded on and improved, including ways to reduce the variance of the gradient estimators, deal with black box functions in the policy computation that you do want to differentiate through, and integrating these principles into other algorithms like Monte Carlo tree search. The most famous example of the latter (policy gradients and MCTS) is AlphaZero, which is a general algorithm that can now train itself from scratch to become the best in the world at at chess, go, and many other games without using any historical data.
Actor-Critic Methods: Algorithms that combine policy gradients and value-based methods.Actor-critic algorithms update both the value function and the policy.
Asynchronous Advantage Actor-Critic (A3C): Mnih et al. (2016) present Asynchronous Advantage Actor-Critic (A3C), in which parallel actors employ different exploration policies to stabilize training, and the experience replay is not utilized.
Deep Deterministic Policy Gradient (DDPG): Silver et al. (2014) present Deterministic Policy Gradient (DPG) and Lillicrap et al. (2016) extend it to Deep DPG (DDPG).
Trust Region Policy Optimization (TRPO): Schulman et al. (2015) present Trust Region Policy Optimization (TRPO) and Schulman et al. (2017) present Proximal Policy Optimization (PPO).
Proximal Policy Optimization (PPO): Schulman et al. (2017) present Proximal Policy Optimization (PPO).
Soft Actor-Critic (SAC): Haarnoja et al. (2018) present Soft Actor-Critic (SAC), an off-policy algorithm aiming to simultaneously succeed at the task and act as randomly as possible.
Twin Delayed Deep Deterministic policy gradient algorithm (TD3): Fujimoto et al. (2018) present Twin Delayed Deep Deterministic policy gradient algorithm (TD3) to minimize the effects of overestimation on both the actor and the critic. Fujimoto and Gu (2021) present a variant of TD3 for offline RL.

Exploration vs. Exploitation

A fundamental dilemma in RL is the exploration vs. exploitation tradeoff. The agent needs to exploit the currently best action to maximize rewards greedily, yet it has to explore the environment to find better actions, when the policy is not optimal yet, or the system is non-stationary. A simple exploration approach is ε-greedy, in which an agent selects a greedy action with probability 1 − ε, and a random action otherwise.

Challenges and Future Directions

While RL offers significant potential for control systems, several challenges remain:

Sample Efficiency: RL algorithms often require a large amount of data to learn effectively.
Stability: Training RL agents can be unstable, particularly in complex environments.
Safety: Ensuring the safety of RL-based control systems is crucial, especially in safety-critical applications.
Generalization: RL agents may struggle to generalize to new situations or environments.

Future research directions include:

Offline Reinforcement Learning: Learning from pre-collected data without further interaction with the environment.
Transfer Learning: Transferring knowledge learned in one environment to another.
Hierarchical Reinforcement Learning: Decomposing complex tasks into simpler subtasks.
Safe Reinforcement Learning: Developing algorithms that guarantee safety during training and deployment.

tags: #control #system #reinforcement #learning #overview