Overview

Markov decision processes (MDPs) are a formalism for fully observable, sequential complex decision problems where an agent engages in an extended interaction with the environment. In other words, an MDP is a reinforcement learning task that satisfies the Markov property.

A Markov decision process $(S, A, T, R, γ)$ has the following data: a state space $S$ ; an action space $A$ ; a transition function $T (s ’∣ s, a)$ for how actions in states lead to new states; a reward function $R (s, a, s ’)$ which has a value corresponding to each transition; and a discount rate $γ \in [0, 1]$ .

For a given MDP, a policy $π$ represents the behavior of a decision-maker by specifying what actions will be taken at each state ¹.

Related notes:

Decision policies in MDPs

Primary note: Action selection in decision problems

Decision policy

Given a Markov decision process $(S, A, T, R, γ)$ , a policy $π$ is a mapping $(s, a) \mapsto π (a ∣ s),$ where $π (a ∣ s)$ is the probability of taking action $a$ when in state $s$ . For example, a deterministic policy is a mapping $π : S \to A$ , while a stochastic policy maps states to action distributions $π : S \to Δ (A)$ .

Trajectory of a MDP, return

Given a Markov decision process with reward function $R (s, a, s ’)$ and transition function $T (s ’∣ s, a)$ , a single trajectory is a sequence of states, actions, and rewards
$⟨ s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, ... ⟩,$
where the reward $r_{t} = R (s_{t}, a_{t}, s_{t + 1})$ is defined over a state/action/next-state tuple. The return associated with the trajectory is the discounted cumulative reward
$G_{t} = t = 0 \sum \infty γ^{t} r_{t} .$

State reward function

In some cases, such as in the literature on successor representation, the state reward function associated with following policy $π$ is denoted $R^{π} : S \to R$ , where $R^{π} (s)$ is precisely the expected immediate reward received when in state $s$ and taking actions with probability $π (a ∣ s)$ .

The expression for $R^{π} (s)$ depends on the form of the reward function, as well as whether the policy is deterministic or stochastic:

Reward function form	Deterministic policy $π (s) = a$	Stochastic policy $π (a ∣ s)$
$R : S \to R$	$R (s)$	$R (s)$
$R : S \times A \to R$	$R (s, π (s))$	$\sum_{a} π (a ∣ s) R (s, a)$
$R : S \times A \times S \to R$	$\sum_{s ’} T (s ’ ∣ s, a) R (s, a, s ’)$	$\sum_{a} π (a ∣ s) \sum_{s ’} T (s ’ ∣ s, a) R (s, a, s ’)$

Variations

	Single agent	Multi-agent
Fully observable	Markov decision process (MDP)	Stochastic game
Partially observable	Partially observable Markov decision process (POMDP)	Partially observable stochastic game?

Partially observable MDPs

Multi-agent MDPs with sub-tasks

@2021wuToo

Meta-level MDP

wip @2024griffithsBayesian 331

Code snippets

(\mathcal S, \mathcal A, T, R, \gamma)

Notes

@2024griffithsBayesian policy iteration? p. 208
The MDP framework with differs from a full reinforcement learning setting in that, in the MDP framework, we have general access to the underlying transition probabilities and reward function, and we can examine any state at any time with dynamic programming instead of sampling in sequence. Specifically, in the MDP framework, rewards are given while values are computed or learned; in contrast, the reinforcement learning problem involves trying to find the optimal policy for an MDP in the absence of information about how actions modify states, or what costs or rewards result from an action.

In the language of more general sequential decision problems (i.e., in @2025icardResource), this is also known as a stationary strategy, a map from histories to action distributions that depends only on the current state. ↩

BONNIE'S NOTES

Table of Contents

Markov decision processes

Overview

Decision policies in MDPs

State reward function