Overview

Markov decision processes (MDPs) are a formalism for fully observable, sequential complex decision problems where an agent engages in an extended interaction with the environment.

A Markov decision process $(S, A, T, R, γ)$ has the following data: a state space $S$ ; an action space $A$ ; a transition function $T (s ’∣ s, a)$ for how actions in states lead to new states; a reward function $R (s, a, s ’)$ which has a value corresponding to each transition; and a discount rate $γ \in [0, 1]$ .

For a given MDP, a policy $π$ represents the behavior of a decision-maker by specifying what actions will be taken at each state. Policy prediction or evaluation involves calculating the expected return from a fixed policy, while policy control involves selecting the optimal policy based on the highest expected reward.

Related notes:

Key terms

Policy: a formalization of agent behavior as a function $S \to A$ mapping states to actions.
Prediction: also known as policy evaluation; the computational problem of determining how much long-term reward would be obtained given a policy and an initial state.
Value: a standard method of defining long-term reward as the expected cumulative discounted infinite sum of rewards. In contrast to reward, value is derived from a combination of rewards, the environment, and future behaviors.
Optimal control: also known as policy optimization; the computational problem of finding the policy with the maximal value function given an MDP. Using the standard definition of value above, there is a unique optimal value function expressed in terms of Bellman optimality equations.
Planning: the control problem with a known reward and transition model.
Dynamic programming: in this context, a class of algorithms that, given the full state space, calculates the value function via backward induction. wip
Value iteration: the dynamic programming algorithm for optimal control.

Policies and value functions

Decision policy, value

Given a Markov decision process $(S, A, T, R, γ)$ , a policy $π$ represents the behavior of a decision-maker by specifying what actions will be taken at each state. For example, a deterministic policy is a mapping $π : S \to A$ , while a stochastic policy maps states to action distributions $π : S \to Δ (A)$ .

The value of a policy $π$ from a state $s_{0}$ is the expected cumulative, discounted reward that results from following $π$ .

Trajectory of a MDP, return

Given a Markov decision process with reward function $R (s, a, s ’)$ and transition function $T (s ’∣ s, a)$ , a single trajectory is a sequence of states, actions, and rewards
$⟨ s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, ... ⟩,$
where the reward $r_{t} = R (s_{t}, a_{t}, s_{t + 1})$ is defined over a state/action/next-state tuple. The return associated with the trajectory is the discounted cumulative reward
$t = 0 \sum \infty γ^{t} r_{t} .$

Policy prediction

The policy prediction problem is to calculate the state value function, or expected return, from every state that results from following a fixed policy $π$ .

Equation: State value

The value of a state $s$ under policy $π$ is given by the expected return (i.e., cumulative discounted reward) conditioned on starting at state $s_{0} = s$ and following $π$ thereafter:
$v_{π} (s) = E_{π, T} [t = 0 \sum \infty γ^{t} r_{t} ∣ s_{0} = s],$
where $r_{t} = R (s_{t}, a_{t}, s_{t + 1})$ is the reward at $t$ with action given by the policy $a_{t} \sim π (\cdot ∣ s_{t})$ and next state given by transition dynamics $s_{t + 1} \sim T (\cdot ∣ s_{t}, a_{t})$ , and $γ \in (0, 1]$ is the discount rate.

The Monte Carlo policy prediction algorithm calculates $v_{π}$ by repeatedly sampling trajectories and averaging returns.

Algorithm: Monte Carlo policy prediction

Input: A policy $π$ to evaluate and a state $s$ to evaluate from.

Initialize the return $G = 0$ and timestep $t = 0$

While $s$ is not an absorbing state (a state that always leads to itself regardless of what action is taken, and always returns a reward of $0$ ):

Sample action $a \sim π (a ∣ s)$

Sample next state $s ’ \sim T (s ’∣ s, a)$

Calculate reward $r = R (s, a, s ’)$

Update $G \leftarrow G + γ^{t} \cdot r$

Update $s \leftarrow s ’$

Return $G$

On the other hand, discounted infinite horizon MDPs have the property that $v_{π}$ can be expressed with a set of primitive recursive equations. This allows for calculating calculating exact values using dynamic programming: the general idea is to initialize the value function $v_{π} (s)$ arbitrarily, then use the right-hand expression to calculate updated value functions until the function stops changing and Bellman’s equations are satisfied.

Equation: Bellman’s equation

$v_{π} (s) = a, s^{'} \sum π (a ∣ s) T (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ v_{π} (s^{'})]$

Algorithm: Bellman policy prediction

Input: A policy $π$ , a MDP $(S, A, T, R, γ)$ , and a convergence precision $Δ_{ma x}$ .

Initialize $v_{π} (s) = 0$ for all $s \in S$

For $i$ in ${1, 2, \dots, n_{ma x i t ers}}$ :

$Δ = 0$

For $s \in S$ :

If $s$ is an absorbing state:

Skip

$v_{n e w} \leftarrow \sum_{a, s ’} π (a ∣ s) T (s ’∣ s, a) [R (s, a, s ’) + γ v_{π} (s ’)]$

$Δ = max (Δ, ∣ v_{π} (s) - v_{n e w} ∣)$

$v_{π} (s) \leftarrow v_{n e w}$

If $Δ < Δ_{ma x}$ :

Break

Return ${v_{π} (s)}_{s \in S}$

Policy control

Policy control is the problem of calculating the optimal policy, which can again be computed recursively. Rather than selecting an action according to a policy $π$ , the Bellman optimality equations express selecting the best action in terms of state values. Like before, we have an iterative algorithm that converges on a $v_{*}$ satisfying these equations.

Equation: Bellman optimality equations

$v_{*} (s) = a max {s^{'} \sum T (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ v_{*} (s^{'})]}$

Algorithm: Value iteration

Input: An MDP $(S, A, T, R, γ)$ and a convergence precision $Δ_{ma x}$ .

Initialize $v_{*} (s) = 0$ for all $s \in S$ .

For $i$ in ${1, 2, \dots, n_{ma x i t ers}$ :

$Δ = 0$

For $s \in S$ :

If $s$ is absorbing:

Skip

$v_{n e w} \leftarrow max_{a} \sum_{s^{'}} T (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ v_{*} (s^{'})]$

$Δ = max (Δ, ∣ v_{*} (s) - v_{n e w})$

$v_{*} (s) \leftarrow v_{n e w}$

If $Δ < Δ_{ma x}$ :

Break

Return ${v_{*} (s)}_{s \in S}$

Given $v_{*}, R, T$ , the optimal policy can be calculated by calculating the optimal state action values for each action in a state, then taking the best action(s). In particular, we have the $Q$ -value (for “quality”) of an action $a$ taken at a state $s$ assuming that we will act optimally from then on.

Equation: $Q$ -value

$q_{*} (s, a) = s^{'} \sum T (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ v_{*} (s^{'})]$

The optimal stochastic policy is then any policy that is greedy with respect to the optimal $Q$ -values, meaning it is uniform over all actions equal to the maximum value ¹:

π_{*} (a ∣ s) \propto 1 [q_{*} (s, a) = b max q (s, b)],

where $1 [P] = 1$ if $P$ is true and $0$ otherwise, thus $\sum_{a} π_{*} (a ∣ s) = 1$ normalizes to a probability distribution.

Variations

Partially observable MDPs

Code snippets

(\mathcal S, \mathcal A, T, R, \gamma)

Notes

@2024griffithsBayesian policy iteration? p. 208
The MDP framework with differs from a full reinforcement learning setting in that, in the MDP framework, we have general access to the underlying transition probabilities and reward function, and we can examine any state at any time with dynamic programming instead of sampling in sequence. Specifically, in the MDP framework, rewards are given while values are computed or learned; in contrast, the reinforcement learning problem involves trying to find the optimal policy for an MDP in the absence of information about how actions modify states, or what costs or rewards result from an action.

For example, if there is only one maximum value, then the policy is deterministic. ↩

BONNIE'S NOTES

Table of Contents

Markov decision processes and dynamic programming

Overview

Key terms

Policies and value functions

Policy prediction

Policy control

Variations

Partially observable MDPs

Stochastic games

Multi-agent MDPs

Multi-agent MDPs with sub-tasks

Meta-level MDP

Code snippets

Notes

Graph View

Backlinks

BONNIE'S NOTES

Table of Contents

Markov decision processes and dynamic programming

Overview

Key terms

Policies and value functions

Policy prediction

Policy control

Variations

Partially observable MDPs

Stochastic games

Multi-agent MDPs

Multi-agent MDPs with sub-tasks

Meta-level MDP

Code snippets

Notes

Footnotes

Graph View

Backlinks