Overview

In a decision problem (such as a bandit or sequential decision problem), a decision policy is a mechanism for selecting actions; for instance, a stochastic decision policy assigns a probability of being chosen to each valid action. The two most commonly used decision policies are ( $ϵ$ -)greedy and softmax.

While learning allows agents to find the best actions, decision policies address the exploration-exploitation dilemma—whether to exploit actions that led to high reward in the past (i.e., act greedily), or explore to potentially achieve better results in the future.

Greedy policy

Greedy policies select the action(s) with highest estimated value.

$ϵ$ -greedy decision policy

Given some $ϵ \in [0, 1]$ , an $ϵ$ -greedy policy selects the action(s) with highest estimated value with probability $1 - ϵ$ , and a random action with probability $ϵ$ .

Softmax decision policy

Softmax decision policy

Let $A$ be a set of actions such that each $a \in A$ is associated with a “value” $V (a) \in R$ . Then at time $t$ , the softmax (a.k.a. Boltzmann, Gibbs) decision policy selects $a$ with probability
$Pr {A_{t} = a} = \frac{e ^{β V (a)}}{\sum _{a^{'} \in A} e ^{β V (a^{'})}},$
where $β$ is called the inverse temperature parameter.

The inverse temperature $β$ controls the balance between exploration and exploitation: as $β \to 0$ , the softmax rule approaches complete randomization (i.e., high exploration), while as $β \to 1$ , the softmax rule approaches the maximum expected value (i.e., high expectation).

BONNIE'S NOTES

Table of Contents

Action selection in decision problems

Overview

Greedy policy

Softmax decision policy

Graph View

Backlinks