Overview

Reinforcement learning (RL) is the method of learning from rewards and punishments rather than explicit instructions. Specifically, the agent must learn a behavioral policy, or mapping from states to actions, which maximizes cumulative long-term reward. The agent then explores or exploits the state using actions and learns from the resulting rewards.

Reinforcement learning can be divided into model-free, which is based on direct stimulus-response associations (i.e., a value associated with each action), and model-based, which leverages internal representations of the task structure.

There are two general approaches to reinforcement learning algorithms: value-based methods, like Monte Carlo and temporal difference learning, attempt to learn a value function and then derive a policy, while policy-gradient methods directly learn and optimize the parameters of a policy function.

Related: Algorithms for simple vs. complex decision problems


Topics


Key terms

  • Reward prediction error = an “index of surprise” that reflects the difference in value between a received reward and a predicted reward at each moment in time; important for temporal RL algorithms.
  • Episodic reinforcement learning = a learning approach which keeps an explicit record of past events, and uses this record directly as a point of reference in making new decisions (see: episodic memory).
  • Meta-reinforcement learning = when one learning system progressively adjusts the operation of a second learning system, improving the latter’s speed and efficiency (see: meta-learning)

Notes

  • Interestingly, reinforcement learning is one of the only domains in cognitive science where all levels of explanation are understood.
    • Computational/knowledge – maximizing reward.
    • Algorithmic – temporal difference learning for a value function.
    • Implementational/physical – dopamine neurons encode error signals needed for updating association values.
  • Main algorithms
    • Value-based prediction: Monte Carlo and TD Learning
    • Value-based control: Monte Carlo control, Sarsa, Q-learning
    • N-step bootstrap and elgibility traces
    • Function approximation: deep Q-networks
    • Policy gradient methods: REINFORCE, Actor-Critic