Overview

An alternative to both model-free and model-based representations of a reinforcement learning environment is successor representation, which encodes an “occupancy count” for each state based on whether they predict visiting other states when following some policy.

Successor representation

The successor representation for a policy $π : S \to Δ (A)$ is defined for all $s, s^{+} \in S$ , where $s$ is a current state and $s^{+}$ is a future state, by
$M^{π} (s^{+} ∣ s) = E_{π} [t = 0 \sum \infty γ^{t} 1 [s_{t} = s^{+}] ∣ s_{0} = s],$
where $1 [s_{t} = s^{+}]$ is the indicator function that returns $1$ when the current state $s_{t}$ is $s$ and $0$ otherwise.

Indicator fn like reward fn?

Equation: Recursive equations for SR

For any policy $π : S \to Δ (A)$ , the successor representation for a current state $s$ and future state $s^{+}$ can be computed by
$M^{π} (s^{+} ∣ s) = 1 [s^{+} = s] + γ a \in A \sum π (a ∣ s) s^{'} \in S \sum T (s^{'} ∣ s, a) M^{π} (s^{+} ∣ s) .$
If $π : S \to A$ is a deterministic policy, we have
$M^{π} (s^{+} ∣ s) = 1 [s^{+} = s] + γ s^{'} \in S \sum T (s^{'} ∣ s, π (s)) M^{π} (s^{+} ∣ s^{'}) .$

Computing state value from the SR

Expressing the value of a state using SR and state reward

Let $π : S \to Δ (A)$ be a policy for acting in a Markov decision process with reward function $R : S \times A \times S \to R$ , and let
$R^{π} (s) = a \sum π (a ∣ s) s ’ \sum T (s ’ ∣ s, a) R (s, a, s ’)$
be the state reward function associated with $π$ . Then the state value function can be computed as a linear combination of the successor representation and state reward function:
$v_{π} (s) = s^{+} \sum M^{π} (s, s^{+}) R^{π} (s^{+}) .$

Proof adapted from @2024griffithsBayesian, p. 214. We have

v_{π} (s) = s^{+} \sum M (s^{+} ∣ s) R^{π} (s) = s^{+} \sum (1 [s = s^{+}] + γ a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) M^{π} (s^{+} ∣ s^{'})) R^{π} (s^{+}) = s^{+} \sum 1 [s = s^{+}] R^{π} (s^{+}) + γ a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) s^{+} \sum M^{π} (s^{+} ∣ s^{'}) R^{π} (s^{+}) = R^{π} (s) + γ a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) v_{π} (s^{'}) = a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) R (s, a, s^{'}) + γ a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) v_{π} (s^{'}) = a \sum π (a ∣ s) s^{'} \sum T (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ v_{π} (s^{'})],

where the fourth line expresses $v_{π} (s)$ as a sum of expected immediate reward and expected discounted future value

v_{π} (s) = R^{π} (s) + γ E_{a \sim π, s^{'} \sim T} [v_{π} (s^{'})],

and the final equality gives the familiar Bellman equation for state value. $□$

BONNIE'S NOTES

Table of Contents

Successor representation

Overview

Recursive equations for SR

Computing state value from the SR

Graph View

Backlinks