David Silver's Reinforcement Learning Course: A Comprehensive Overview

David Silver's reinforcement learning course provides a comprehensive introduction to the field, covering fundamental concepts and advanced techniques. This article will delve into the core ideas presented in the course, drawing heavily from the lecture materials and expanding upon them for clarity and depth. The course is based on David Silver’s lecture video. All images used in this post are courtesy of David Silver.

Introduction to Reinforcement Learning

Reinforcement learning (RL) is a paradigm of learning where an agent learns to make decisions in an environment to maximize a cumulative reward. Unlike supervised learning, there is no explicit supervision; the agent learns solely from the feedback it receives in the form of rewards. This feedback may not be instantaneous, and the consequences of an action may only be realized after a significant delay.

Key Characteristics of Reinforcement Learning

Several key characteristics distinguish reinforcement learning from other machine learning paradigms:

Reward Signal: The agent receives a scalar reward signal as feedback. The agent’s job is to maximise the cumulative reward.
Delayed Feedback: Feedback may not be instantaneous. That is, the current action may bring positive rewards only after a long time.
Sequential Decision Making: The decision-making process is sequential, meaning the order of operations and input matters.
Agent-Environment Interaction: The actions taken by the agent affect subsequent actions and the data it receives.

Rewards and the Reward Hypothesis

A reward is a scalar feedback signal. The reward should be simple enough to be represented in a scalar form. Though using complex rewards is possible, it would only add unnecessary complexity to the algorithm. Hence, rewards should be scalar in nature. The agent’s job is to maximise the cumulative reward.

The reward hypothesis states that all goals can be described by the maximization of expected cumulative rewards. This is a fundamental assumption in reinforcement learning, and it places the onus on the programmer to design a reward function that aligns with the desired behavior.

Read also: Letterman Scholarship Requirements

For example, a helicopter performing stunts can receive a positive reward for following a desired trajectory and a negative reward for crashing or missing the trajectory. Similarly, a game-playing agent can receive a positive reward for increasing the score and a negative reward for decreasing the score.

Sequential Decision Making and Cumulative Reward

Rewards shouldn’t always be greedy in nature. Getting the best reward at present time step may not lead to the best cumulative reward. A financial investment (initial loss) may lead to an increase in bank balance after many years.

Agent-Environment Interaction: History and State

The agent interacts with the environment by taking actions and receiving observations and rewards. This interaction creates a history, which is the sequence of observations, actions, and rewards.

History: A sequence of observations, actions, and rewards: (Ht = O1, R1, A1, …, A{t-1}, Ot, R_t)

The state is a function of the history. This means that the definition of state is dependent on the programmer. A state may be defined as the last 4 observations, actions, and rewards, only the last observation, actions, rewards or the entire history or any other combination which the programmer desires.

State: A function of the history used to determine the next action. (St = f(Ht))

Environment State vs. Agent State

The environment state is the environment's internal representation, based on which it generates the next observation and reward. The agent state, on the other hand, is the agent's internal representation, used to choose the next action.

Environment State: The environment’s internal representation. Depending on this, it will spit out the next observation and reward to the agent.
Agent State: The agent’s internal representation. This is used by the agent to pick the next action.

Markov States

A state is a Markov state if and only if the next state can be determined purely based on the present state. That is, to determine (S{t+1}), only (St) is required, and any other previous states are not required.

Markov Property: (P[S{t+1} | St] = P[S{t+1} | S1, …, S_t])

However, the definition of S is dependent on the programmer. It could mean last 4 observations, rewards, actions or any other combination.

Fully and Partially Observable Environments

Environments can be categorized based on the agent's ability to observe the environment state:

Fully Observable Environments: The agent has access to all relevant information. The agent state is the same as the environment state. Taking decisions in such an environment is called as Markov Decision Process.
Partially Observable Environments: The agent does not have access to all required information. The environment state is different from the agent state. In such a case, the agents need to construct their own representations. This can be done trivially by considering the entire history. Other common ways are to use belief states, i.e probability of it being in some state and choosing the one which is most probable.

For example, while playing poker, if we are able to count cards, we can probabilistically determine the chance of different cards present in different players’ hands. Another way is to use neural networks to predict the next state.

Main Components of an RL Agent

An RL agent typically consists of three main components:

Policy: Agent’s behaviour function.
Value Function: Goodness or badness of a state.
Model: Agent’s representation of the environment.

Policy

The policy defines the agent's behavior. It can be deterministic, outputting a specific action, or stochastic, outputting a probability distribution over actions.

Policy: (\pi(a|s) = P[At = a | St = s])

Policy can be deterministic, i.e it will ouput the exact action to take or it can be stochastic; it outputs the probability of taking different actions.

Value Function

The value function predicts the expected cumulative reward from a given state, following a specific policy. It quantifies the "goodness" or "badness" of a state.

Value Function: (V\pi(s) = E\pi[Gt | St = s]) where (Gt = R{t+1} + \gamma R_{t+2} + …)

In value function, notice that the future rewards have a increasing gamma parameter with it. It is basically used to reduce the impact of the future rewards. That is, the rewards which are immediate are more important than the rewards which are far away. Hence, the rewards which are many time steps away should be given lesser importance as there is more uncertainty. This is handled using the gamma parameter which is a value less than 1.

Model

The model represents the agent's understanding of the environment. It can be used to predict the next state and reward given the current state and action.

(P{ss'}^a = P[S{t+1} = s' | St = s, At = a]) (Transition probabilities)
(Rs^a = E[R{t+1} | St = s, At = a]) (Expected reward)

Illustrative Examples

Consider a maze example:

Policy Alone: At each state, the policy shows the direction in which the agent should move.
Value Function: At each state, the value function shows the goodness/badness of the state. The start state is a bad state as it is far away from the goal, hence a negative value. The agent would move in the direction which gives a higher value.
Model: Every grid (state) shows the reward for that state alone. The model uses this reward to calculate the cumulative reward and take actions accordingly.

Categorization of RL Agents

RL agents can be categorized based on various factors, including:

Value-based: Learn a value function.
Policy-based: Learn a policy directly.
Model-based: Learn a model of the environment.
Model-free: Do not learn a model.

Exploration vs. Exploitation

A fundamental challenge in reinforcement learning is the trade-off between exploration and exploitation. The agent must explore the environment to discover new and potentially better actions, but it must also exploit its current knowledge to maximize its cumulative reward.

Exploration: Discovering new information about the environment.
Exploitation: Using existing knowledge to maximize reward.

Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formal way to define reinforcement learning problems. Richard Bellman (1920-1984) developed dynamic programming during his time at RAND, a research organisation working with the US government on defence and non-defence matters.

An MDP is defined by a tuple () where:

S is a set of states.
A is a set of actions.
R is a reward function.
P is a transition probability matrix.
γ is the discount factor.

Markov Reward Process (MRP)

A Markov Reward Process (MRP) is a tuple () where S is a set of states, P the transition matrix, R the reward matrix, γ the discount factor.

Policy in MDPs

A policy π is a distribution of actions given a state s. The state-value function is redefined as as the expected present value of rewards in state s given policy π. The action-value function is defined as the expected value of rewards from state s given a policy π, taking an action a.

Bellman Equation for MDPs

Bellman equation for MDPs: the state-value function in state s is the reward from leaving state s plus the values in all possible future states weighted by their probability of happening P and with discount factor γ (see Bellman equation). Actions are factored in through policy π.

Optimality

A policy is optimal if the state-value function for that policy is at least as good as other policies, for every state. There exists an optimal policies exist in MDPs that is better than or equal to all other policies, and achieve optimal value and optimal action-value function.

Optimal Solutions for MDPs

Optimal solutions for MDPs: no closed form solution (in general), but at least four iterative methods: (i) value iteration, (ii) policy iteration, (iii) Q-learning, (iv) Sarsa

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP . Evaluation: Given a policy, can we determine the value function? Control: Amongst all policies, what is the best possible policy?

Policy Evaluation

Policy evaluation: start with a value function, then iterate over all states to calculate the new value function in single sweeps using the Bellman equation. Repeat until convergence.

Control with Policy Iteration

Control with policy iteration: evaluate a policy π. Improve the policy by acting greedily π’ = greedy(v_π). Example: 10x10 gridworld, car park transfer example.

Control with Value Iteration

Control with value iteration: evaluate a policy π. For evaluation/prediction, three methods: (i) Monte Carlo (ii) Temporal-Difference analysis (iii) TD(λ)

Monte Carlo Methods

Monte Carlo: first-visit and every-visit Monte Carlo methods are equally useful; record the value for each path followed under a policy, then average over the number of visits to get an estimate of the value function in a given state. We can use incremental means to update the value function over iterations:

A mean, calculated as a function of latest value and previous mean

This is the inspiration for an update of the value function based on the latest sample value Gt

Temporal-Difference Learning

Temporal-Difference: learns from incomplete episodes, by bootstrapping (updates a guess towards a guess).

Course Outline and Additional Concepts

David Silver's course also touches upon advanced topics such as:

Trust Region Policy Optimization (TRPO): Combines theoretical ideas from conservative policy gradient algorithm to prove that monotonic improvement can be guaranteed when one solves a series of subproblems of optimizing a bound on the policy performance.
Q-learning: Convergence result is originally due to Watkins.
Deep Q-Networks (DQN): Introduced by Mnih et al.
Approximate Modified Policy Iteration: A very general framework that subsumes many other ADP algorithms as special cases (Scherrer et al.).
SEARN: Similar to DAGGER/AGGREVATE but using a stochastic policy, and targeted at structured prediction problems.
Constrained Guided Policy Search (CGPS): Formulates an objective that jointly includes a collection of trajectories and a policy, and encourages them to become consistent.
Combining the Benefits of Function Approximation and Trajectory Optimization: Jointly optimizing trajectories and policies, with some different design choices from CGPS.

Historical Context

Andrey Markov (1856-1922) developed the technique now known as Markov chains in the early 20th century to examine the pattern of vowels and consonants in Alexander Pushkin’s novel Eugen Onegin (First Links in the Markov Chain, Hayes, 2013).

tags: #David #Silver #reinforcement #learning #course