Overview
Reinforcement learning (RL) is the method of learning from rewards and punishments rather than explicit instructions. Specifically, the agent must learn a behavioral policy, or mapping from states to actions, which maximizes cumulative long-term reward. The agent then explores or exploits the state using actions and learns from the resulting rewards.
Reinforcement learning can be divided into model-free, which is based on direct stimulus-response associations (i.e., a value associated with each action), and model-based, which leverages internal representations of the task structure.
There are two general approaches to reinforcement learning algorithms: value-based methods, like Monte Carlo and temporal difference learning, attempt to learn a value function and then derive a policy, while policy-gradient methods directly learn and optimize the parameters of a policy function.
Related: Algorithms for simple vs. complex decision problems
Topics
- Bandits, exploration, and exploitation
- Markov decision processes and dynamic programming
- Temporal difference learning
- Intrinsically motivated reinforcement learning
Key terms
- Reward prediction error = an “index of surprise” that reflects the difference in value between a received reward and a predicted reward at each moment in time; important for temporal RL algorithms.
- Episodic reinforcement learning = a learning approach which keeps an explicit record of past events, and uses this record directly as a point of reference in making new decisions (see: episodic memory).
- Meta-reinforcement learning = when one learning system progressively adjusts the operation of a second learning system, improving the latter’s speed and efficiency (see: meta-learning)
Notes
- Interestingly, reinforcement learning is one of the only domains in cognitive science where all levels of explanation are understood.
- Computational/knowledge – maximizing reward.
- Algorithmic – temporal difference learning for a value function.
- Implementational/physical – dopamine neurons encode error signals needed for updating association values.
- Main algorithms
- Value-based prediction: Monte Carlo and TD Learning
- Value-based control: Monte Carlo control, Sarsa, Q-learning
- N-step bootstrap and elgibility traces
- Function approximation: deep Q-networks
- Policy gradient methods: REINFORCE, Actor-Critic