Overview

Reinforcement learning (RL) is the method of learning from rewards and punishments rather than explicit instructions. Specifically, the agent must learn a behavioral policy, or mapping from states to actions, which maximizes cumulative long-term reward. The agent then explores or exploits the state using actions and learns from the resulting rewards.

Reinforcement learning can be divided into model-free, attempts to estimate reward values of particular actions directly based on stimulus-response associations (i.e., a value associated with each action), and model-based, which leverages internal representations of the task structure to build representations of plans and their expected outcomes to choose between policies.

There are two general approaches to reinforcement learning algorithms: value-based methods, like Monte Carlo and temporal difference learning, attempt to learn a value function and then derive a policy, while policy-gradient methods directly learn and optimize the parameters of a policy function.

Topics

Key terms

Reward prediction error = an “index of surprise” that reflects the difference in value between a received reward and a predicted reward at each moment in time; important for temporal RL algorithms.
Episodic reinforcement learning = a learning approach which keeps an explicit record of past events, and uses this record directly as a point of reference in making new decisions (see: episodic memory).
Meta-reinforcement learning = when one learning system progressively adjusts the operation of a second learning system, improving the latter’s speed and efficiency (see: meta-learning)

Notes

Interestingly, reinforcement learning is one of the only domains in cognitive science where all levels of explanation are understood.
- Computational/knowledge – maximizing reward.
- Algorithmic – temporal difference learning for a value function.
- Implementational/physical – dopamine neurons encode error signals needed for updating association values.
Main algorithms
- Value-based prediction: Monte Carlo and TD Learning
- Value-based control: Monte Carlo control, Sarsa, Q-learning
- N-step bootstrap and elgibility traces
- Function approximation: deep Q-networks
- Policy gradient methods: REINFORCE, Actor-Critic

BONNIE'S NOTES

Table of Contents

Reinforcement learning

Overview

Topics

Key terms

Notes

Graph View

Backlinks