Overview
In a decision problem (such as a bandit or sequential decision problem), a decision policy is a mechanism for selecting actions; for instance, a stochastic decision policy assigns a probability of being chosen to each valid action. The two most commonly used decision policies are (-)greedy and softmax.
While learning allows agents to find the best actions, decision policies address the exploration-exploitation dilemma—whether to exploit actions that led to high reward in the past (i.e., act greedily), or explore to potentially achieve better results in the future.
Greedy policy
Greedy policies select the action(s) with highest estimated value.
-greedy decision policy
Given some , an -greedy policy selects the action(s) with highest estimated value with probability , and a random action with probability .
Softmax decision policy
Softmax decision policy
Let be a set of actions such that each is associated with a “value” . Then at time , the softmax (a.k.a. Boltzmann, Gibbs) decision policy selects with probability
where is called the inverse temperature parameter.
The inverse temperature controls the balance between exploration and exploitation: as , the softmax rule approaches complete randomization (i.e., high exploration), while as , the softmax rule approaches the maximum expected value (i.e., high expectation).